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FOREWORD 


This  is  Volume  I  of  a  group  of  three  containing  work  on  automatic  language 
processing.  This  work  involved  both  theoretical  and  experimental  in¬ 
vestigations  of  natural  language  characteristics  as  well  as  the  properties 
of  actual  and  laboratory-modified  document  collections  and  retrieval 
situations.  Titles  of  the  volumes  are: 

Volume  I  Selected  Collection  Statistics  and  Data  Analyses 

Volume  II  Linear  Models  for  Associative  Retrieval 
Volume  III  Development  of  String  Indexing  Techniques 

This  work  was  conducted  in  support  of  Project  2806,  Task  280601  by 
Arthur  D.  Little,  Inc.,  35  Acorn  Park,  Cambridge,  Massachusetts  under 
Contract  AF  19  (628)  -  3311.  Our  internal  code  for  this  contract  is 
C  -  65850.  The  work  was  also  supported  in  part  under  Contract  AF  19 
(628)  -  4067. 

The  program  was  monitored  for  the  U.  S.  Air  Force  by  John  B.  Goodenough 
ESVPD  and  was  principally  performed  during  the  period  January  1965  to 
December  1966,  and  the  draft  report  was  submitted  on  15  January  1967. 

We  wish  to  acknowledge  the  important  contribution  of  Philip  Hankins, Inc. 
in  performing  part  of  the  computer  programming  under  subcontract  to  us. 

The  cooperation  between  the  Science  and  Technology  Information  Division 
of  NASA  and  the  Decision  Sciences  Laboratory  has  been  of  immense  value 
to  the  research  reported  here.  In  particular  programs  developed  under 
our  Contract  NASW-1051  with  NASA  have  been  used  in  some  of  the  investi¬ 
gations  reported  here,  and  conversely. 

This  Technical  Report  has  been  reviewed  and  approved. 


Project  Officer 
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ABSTRACT 


As  part  of  a  research  program  aimed  at  determining  the  parameters 
influencing  the  effectiveness  of  a  message  retrieval  system,  a  collec¬ 
tion  of  10,000  technical  abstracts  was  indexed  and  retrieval  experiments 
were  conducted  with  them.  Since  part  of  the  work  involved  the  develop¬ 
ment  and  test  operation  of  an  associative  retrieval  system,  basic  data 
about  the  distribution  of  words  and  word  strings  were  gathered  in  pre¬ 
paring  the  system  for  test  and  trial.  These  statistics  were  thought 
to  be  of  possible  interest  to  other  workers  in  the  field  and  are  gathered 
as  a  series  of  loosely  connected  papers  in  this  folume  under  the  fol¬ 
lowing  groupings:  Characteristics  and  Indexing  of  GE  Data  Base; 
Comparison  of  Manual  and  Machine  Selected  Vocabularies;  Vocabulary 
Distribution  Studies,  and  Studies  of  Content  Bearing  Units  in  Text. 


PREFACE 


In  the  course  of  five  years  of  work  on  automatic  language  proces¬ 
sing  which  involved  both  theoretical  and  experimental  investigations  of 
natural  language  characteristics  and  the  properties  of  actual  and  labo¬ 
ratory-modified  document  collections  and  retrieval  situations,  we  have 
written  a  number  of  technical  papers,  working  papers,  and  internal  notes 
whose  possibly  useful  content  has  by  no  means  been  completely  revealed 
in  the  course  of  publishing  a  series  of  Project  reports. 


We  feel  it  desirable  to  make  this  material  available  to  other 
workers  in  the  field  partly  to  discharge  the  normal  obligations  to  pub¬ 
lish  and  partly  with  the  hope  of  stimulating  further  work  in  a  new,  dif¬ 
ficult,  and  potentially  very  rewarding  field.  We  do  not,  however,  feel 
obligated  to  impose  an  artificial  structure  on  this  work  which  is  essen¬ 
tially  supportive  in  nature.  Therefore,  this  series  of  volumes  is  a 
collection  of  papers,  loosely  grouped  into  areas,  but  not  otherwise 
intended  to  demonstrate  coherence,  some  of  which  are  in  support  of,  and 
others  peripheral  to  our  published  reports. 

This  is  Volume  I  of  a  group  of  three  whose  titles  are: 


VOLUME  I 
VOLUME  II 
VOLUME  III 


Selected  Collection  Statistics  and  Data  Analyses 
Linear  Models  for  Associative  Retrieval 
Development  of  String  Indexing  Techniques 
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SELECTED  COLLECTION  STATISTICS  AND  DATA  ANALYSES 
INTRODUCTION 

As  part  of  a  research  program  aimed  at  determining  the  parameters 
influencing  the  effectiveness  of  a  message  retrieval  system  we  auto¬ 
matically  indexed  and  conducted  retrieval  experiments  on  a  collection 
of  10,000  technical  abstracts.  Each  of  these  abstracts  was  regarded 
as  an  assertive  message  in  its  own  right  --  i.e.,  an  information 
bearing  unit  not  necessarily  related  to  other  messages  in  the  set. 

Part  of  our  work  involved  the  development  and  test  operation  of 
an  associative  retrieval  system  to  operate  on  this  collection.  This 
system,  in  the  form  in  which  it  was  tested,  responded  to  full  text 
English  queries  by  ranking  the  10,000  stored  messages  according  to  the 
relevance  of  each  to  the  submitted  request. 

In  preparing  the  system  for  test  and  trial,  basic  data  about  the 
distribution  of  words  and  word  strings  were  gathered.  Some  of  the 
statistics  were  obtained  because  they  were  needed  for  decisions  we 
made  along  the  way;  other  statistics  were  gathered  largely  because  it 
was  natural  or  easy  to  obtain  them  as  a  byproduct  of  the  processing. 

The  tapes  already  employed  in  a  large  operational  coordinate 
retrieval  system  were  purchased  for  our  experimental  use.  This  collec¬ 
tion  was  chosen  in  part  because  it  had  the  following  desirable  attri¬ 
butes  : 

a.  It  was  developed  independently  of  Arthur  D.  Little,  Inc.,  and 

reflects  a  real  information  retrieval  system  in  current  use. 

b.  The  collection  size  (c.  70,000  documents)  is  sufficiently 

large  to  reflect  a  "matureM  retrieval  system. 

c.  It  is  a  "pure**  coordinate  system  in  the  sense  that  no  hier¬ 
archical  indexing  strategy  is  used. 

d.  Those  responsible  for  the  system  have  resisted  attempts  to 

use  terms  that  are  not  single  words. 

e.  The  vocabulary  of  term-usage  is  relatively  open-ended,  with 
many  synonyms  being  admissible,  and  the  rank- frequency 
characteristic  of  term  usage  tends  to  behave  according  to 
Zipf's  law,  like  natural  language.  The  vocabulary  therefore 
is  of  the  general  kind  which  can  be  obtained  using  automatic 
indexing  techniques . 

f.  Abstracts  of  the  majority  of  the  documents  indexed  in  the 
collection  are  available  in  machine-readable  form. 
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Two  magnetic  tapes  were  obtained.  The  first  of  these  is  an  index 
tape  which  lists,  for  each  of  some  70,000  documents,  the  index  terms 
that  were  assigned  to  them.  The  second  set  of  tapes  consists  of  Eng¬ 
lish  abstracts  of  the  information  contained  in  about  45,000*  of  the 
documents.  4,500-5,000  index  terms  comprised  the  vocabulary  of  terms 
used  to  index  the  documents. 

The  association  programs  for  the  IBM  7090  existing  at  the  time 
could  not  handle  a  collection  larger  than  about  10,000  documents  in¬ 
dexed  by  about  1,000  terms.  Accordingly,  it  was  necessary  to  extract 
a  portion  of  the  given  data  to  serve  as  input  data  for  our  experiments. 

Two  subcollections  were  derived  and  are  referred  to  throughout 
this  Volume.  They  are: 

GE  1-A  Indexing  Vocabulary  -  The  collection  of  70,000  G.E.  docu¬ 
ments  was  manually  indexed  from  a  vocabulary  of  4826  Uni  terms. 

This  vocabulary  was  partitioned  at  ADL,  and  we  identified  a  group 
of  about  1560  primarily  metallurgical  Uniterms  we  wished  to  ex¬ 
clude  from  consideration.  By  choosing  essentially  every  third 
term  from  the  group  of  3266  remaining  terms,  a  1087  term  sample 
of  the  "interesting"  terms  was  obtained.  This  sample  is  GE  1-A. 

GE  2-A  Indexing  Vocabulary  -  GE  2-A  is  the  set  of  999  vocabulary 
items  used  for  automatic  indexing  of  the  collection  of  GE  abstracts. 
Whereas  the  Uniterms  which  form  the  basis  of  GE  1-A  were  assigned 
by  human  indexers  on  the  basis  of  reading  the  whole  document,  the 
GE  2-A  terms  are  the  999  highest- frequency  content  words  which 
appear  in  the  texts  of  45,000  G.E.  abstracts.  Singular  and  plural 
forms  were  coalesced. 

This  volume  gathers  together  some  of  the  more  interesting  statis¬ 
tical  observations,  frequency  data  and  side  effects  that  were  obtained 
during  our  work  with  this  446,097  word  corpus  of  technical  text.  Sec¬ 
tion  I  describes  the  data,  the  processing,  and  some  frequency  distri¬ 
bution  studies.  Section  II  is  a  paper  which  compares  the  manually- 
selected  vocabulary  and  one  selected  by  automatic  processing.  Section 
III  is  a  group  of  papers  dealing  with  vocabulary  distribution  studies 
including  token  frequencies  and  entropy  calculations,  and  the  fitting 
of  the  Herdan-Waring  distribution  and  Zipf  curves  for  the  vocabulary. 

In  Section  IV  the  papers  deal  with  studies  of  content  bearing  units  in 
Text.  A  partial  listing  of  the  data  is  given  and  the  units  which  were 
discovered  as  content  bearing  were  contrasted  with  two  word  index  terms 
in  the  NASA  Vocabulary. 


*  All  abstracts  in  the  collection  were  provided  except  those  which  are 
under  security  classification  or  considered  proprietary  by  the  com¬ 
pany  which  provided  the  data. 
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SECTION  I 


CHARACTERISTICS  AND  INDEXING  OF  GE  DATA  BASE 


Word  Units,  Frequency  Counts,  and  Machine  Indexing  of  GE  2  Data  Base* 


A.  INTRODUCTION 

The  present  note  describes  and  documents  the  computer  processing 
which  was  performed,  during  the  second  half  of  1964,  as  an  adjunct  to 
research  on  techniques  for  automatic  message  retrieval.  During  this 
period,  as  described  in  detail  below,  it  was  found  desirable  to  inves¬ 
tigate  procedures  for  automatically  indexing  short  messages  by  select¬ 
ing  words  (or  strings  of  words)  from  the  text  of  the  message  to  serve 
as  index  terms.  The  "messages"  available  in  sufficient  numbers  on 
magnetic  tape  consisted  of  the  abstracts  of  technical  articles.  This 
note  describes  the  processing  of  these  abstracts  for  the  purpose  of 
obtaining : 

1.  Detailed  frequency  data  about  the  recurrence  patterns  of 
word  strings.  (See  TN  CACL-10  and  Volume  III  this  series.) 

2.  An  operational  retrieval  system  based  on  automatically  index¬ 
ing  these  abstracts.  (See  TN  CACL-11.) 

The  data  base  chosen  consisted  of  a  subset  of  about  10,000  ab¬ 
stracts  chosen  from  the  G.E.  collection  (See  TN  CACL-12)  of  45,000 
abstracts  dealing  with  topics  about  design,  construction  and  testing 
of  aerospace  vehicles.  This  data  will  be  referred  to  as  the  G.E.  2 
data  base.  These  abstracts  of  technical  articles  can  be  viewed  as 
messages,  for  they  report  in  compact  and  precise  form  a  piece  of 
factual  information,  namely  the  content  of  the  document  of  which  they 
are  an  abstract. 

Experiments  on  automatically  recognizing  strings  of  words  as 
conceptual  units,  based  on  knowledge  only  of  the  frequencies  of  the 
strings  and  their  substrings,  had  previously  been  tried  on  a  smaller 
text  with  promising  results**  It  was  therefore  desirable  to  ascertain 
whether  these  techniques  were  of  significant  value  when  applied  to  a 
data  base  sufficiently  large  to  yield  conclusive  results.  The  proce¬ 
dures  for  obtaining  the  desired  frequency  data  for  recurrent  word-strings 
of  length  up  to  4  words  are  described  in  this  note.  The  investigations 
for  which  these  data  are  used  are  discussed  in  TN  CACL-4,  Volume  III, 
this  series. 


*Issued  on  March  9,  1965  to  a  limited  distribution  by  Joyce  S.  Mehring 
as  Technical  Note  CACL-13. 

**  This  work  is  described  in  TN  CACL-4,  Volume  III,  this  series. 
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The  production  of  an  operational  retrieval  system  based  on  auto¬ 
matic  indexing  was  desired  as  part  of  research  on  techniques  for  evalu¬ 
ating  retrieval  systems.  The  orientation  of  this  research  made  it 
desirable  to  prepare  two  operating  systems  in  order  that  their  perfor¬ 
mance  could  subsequently  be  compared  with  each  other  and  with  the  per¬ 
formance  of  human  beings  who  would  also  ’’conduct  retrieval"  on  the 
same  collection.  This  line  of  research  is  discussed  in  TN  CACL-11. 

Both  a  completely  operational  associative  retrieval  system  and  the 
coordinate  system  which  appears  as  a  by-product  were  needed  for  this 
work.  This  section  also  describes  the  procedures  whereby  single  words 
were  used  to  index  the  collection  automatically  for  these  ends. 

B.  COMPUTER  PROCESSING  OF  G.E.  2  DATA 

The  description  of  computer  processing  of  the  G.  E.  2  Data  is  dis¬ 
cussed  in  three  sections.  Section  describes  the  selection  of  a  subset 
of  the  G.  E.  abstracts  to  be  called  the  G.E.  2  data  base  and  the  genera¬ 
tion  of  all  four-word  strings  appearing  in  this  subset.  Section 
describes  the  procedures  for  generating  distinct  four-word,  three-word, 
two-word  and  one-word  strings,  and  the  frequencies  of  strings  and  or¬ 
dered  substrings.  The  procedures  used  to  select  a  set  of  single  words 
to  be  used  as  index  terms  and  the  procedure  for  assigning  index  terms 
to  documents  are  discussed  in  Section 

jL.  Selecting  G.  E.  2  Abstracts  and  Generating  Four-Word  Strings 

To  sample  the  G.  E.  abstracts  and  generate  four-word  strings, 
a  7090  computer  program,  CNTXT,  was  designed  which  operated  on 
tapes  containing  the  G.  E.  abstracts  and  a  tape  containing  excep¬ 
tion  words  and  carried  out  the  following  procedures: 

selected  from  a  G.  E.  subcollection  of  45,000  abstracts 
every  fourth  abstract  containing  more  than  six  lines  and  recorded 
the  10,289  selected  abstracts  on  magnetic  tape; 

for  each  selected  abstract,  produced  the  G.  E.  abstract 
number  and  a  sequential  number  on  magnetic  tape; 

for  each  selected  abstract,  produced  on  magnetic  tape 
four-word  strings  with  the  G.  E.  abstract  number  in  which  the 
string  appeared.  All  words  which  were  exception  words  were  marked 
with  a  leading  blank.  Approximately  446,000  four-word  strings 
were  produced. 

a.  Definition  of  Four-word  String 

A  four-word  string  or  four-word  context  consists  of  four 
contiguous  words  within  an  abstract.  Punctuation  marks  are 
not  considered  to  be  words;  hence  two  words  are  contiguous 
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even  if  separated  by  punctuation  marks.  For  every  word  in 
the  abstract,  except  words  in  the  first  line,  a  four-word 
string  is  produced.  Dummy  strings  are  produced  for  the  last 
three  words  in  an  abstract  by  using  blanks  as  the  terminal 
words  in  the  four-word  string.  The  text  of  the  abstract, 
excluding  the  first  line,  is  scanned  character  by  character 
and  "words:  are  recognized.  In  general  these  "words"  corres¬ 
pond  to  words  as  usually  recognized,  however  a  precise  defini¬ 
tion  of  "word"  in  this  exercise  is  arrived  at  by  applying  the 
definitions  and  rules  given  below. 

(1)  Method,  of  Word  Recognition 

Definitions 

S  -  Special  characters  -  .  )  ,  blank  #  /  (  "j 
A  -  All  other  characters  appearing  in  the  text 
E  -  Special  signals  not  appearing  in  the  text 


Classification  of  characters  for  the  Word  Recognition 

Procedure 


Following  is  a  list  of  characters  and  their 
classifications : 

Type  1  *  - 

Type  2 
Type  3  ), 

Type  4  blank  # 

Type  5  /  ( 

Type  6  All  characters  in  A 

Type  7  join^ 

Type  8  join2 

"Reduction"  Rules 

Certain  pairs  of  characters  from  S  and  E  when 
appearing  as  contiguous  characters  in  the  text  are 
classified  as  a  single  character  from  S  or  E. 

The  "reduction?' rules  used  in  these  classifications 
are  stated  below. 

Type  1 ,  Type  4  - ^  Type  7 

Type  7 ,  Type  4  — - >  Type  7 
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Type  4,  Type  1 
Type  2,  Type  4 
Type  3,  Type  4 
Type  4,  Type  4 
Type  5,  type  4 
Type  8,  Type  4 


Type  8 
Type  4 
->Type  4 
->Type  4 
-^Type  4 
—JType  4 


Special  Character  Configurations  for  the  Word 

Recognition  Procedure 


Break  configuration:  a  Type  6  character  preceded 
by  a  Type  1,  3,  4,  or  5  character. 


Concatenation  configuration:  a  Type  6  character 
preceded  by  a  Type  7  or  8  character. 


Decimal  configuration:  a  Type  6  character  pre¬ 
ceded  by  a  Type  2  character. 

(2)  Word  Recognition  Procedure 


Each  character  in  the  text  excl  ding  the  first  line 
of  each  abstract  is  examined  in  a  left  to  right  scan.  As 
each  character  is  scanned  the  classification  of  the  pre¬ 
ceding  character  is  available.  If  the  character  under 
examination  is  a  character  from  the  set  S,  a  reduction 
rule  is  applied  if  appropriate  and  the  resultant  classifi¬ 
cation  is  retained,  otherwise  the  classification  of  the 
special  character  is  retained.  If  a  character  in  A  is 
examined,  that  is,  a  Type  6  character,  it  is  appended  to 
the  current  character  string  and  its  classification  is 
retained  unless 


(a)  A  break  configuration  has  been  encountered  in 
which  case  the  present  character  string  is  said  to 
be  a  word  and  a  new  character  string  is  started  by 
the  Type  6  character  in  the  configuration. 

(b)  A  concatenation  configuration  is  encountered  in 
which  case  the  Type  6  character  in  the  configuration 
is  appended  to  the  present  character  string. 

(c)  A  decimal  configuration  is  encountered  in  which 
case  the  decimal  point  and  Type  6  character  are 
appended  to  the  character  string.  In  all  cases  the 
scan  is  continued. 

After  each  word  is  recognized  it  is  examined  to  see 
if  it  is  one  of  240  exception  words.  If  the  word  is  an 
exception  word,  it  is  marked  by  a  leading  blank.  Each 
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word  produced  consists  of  a  maximum  of  12  characters. 

If  a  word  recognized  in  the  text  consists  of  more  than 
12  characters,  only  the  first  12  characters  are  used  in 
producing  a  word  to  be  used  in  a  word  string. 

2.  Generating  Contexts  with  Frequency  Information 

The  next  objective  was  to  produce  four-word,  three-word,  two- 
word  and  one-word  contexts,  each  with  information  about  the  fre¬ 
quency  of  occurrence  of  the  contexts  and  substrings  contained  in 
the  contexts.  Let  A,  B,  C,  S  represent  four  words  and  let 

f  represent  the  number  of  times  the  word  A  occurs  in  the  text. 

A 

f^j  represent  the  number  of  times  the  contiguous  pair  AB 
occurs  in  the  text,  etc. 

Then  the  content  of  the  four  desired  lists  can  be  summarized  in 
the  following  way: 

The  one-word  context  list  contains  for  each  distinct  word: 


A  fA 

The  two-word  context  list  contains  for  each  distinct  contigu¬ 
ous  pair: 


A  B 


fB  fAB 


The  three-word  context  list  contains  for  each  distinct  con¬ 
tiguous  triplet: 


A  B 


B 


fAB  fBC  fABC 


The  four-word  context  list  contains  for  each  distinct  con¬ 
tiguous  quadruplet: 

A  B  C  D  f.  f  f^,  f^  f  __  fA__  f___  f . 

A  B  C  D  AB  BC  CD  ABC  BCD  ABCD 


a.  Production  of  Distinct  Strings 

To  obtain  the  context  lists  described  above,  the  first 
major  step  was  to  obtain  distinct  four-word,  three-word,  two- 
word  and  one-word  strings  with  their  frequency  of  occurrence. 

A  7090  computer  program,  SQUISH,  was  designed  to  operate  on 
alphabetically  ordered  four-word  strings  to  produce  the  four 
sets  of  distinct  strings  and  frequencies.  In  preparation  for 
SQUISH,  the  tape  of  four-word  strings  produced  by  CNTXT  was 
sorted  into  alphabetical  order  on  the  strings  by  the  SORT 
program  in  the  IBM  Basic  Monitor  System  IBSYS.  This  sorted 
tape  was  then  used  by  SQUISH  to  produce  four  lists  on  magnetic 
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tape.  The  content  of  the  four  lists  (four  tapes)  is  described 
below: 


Distinct  one-word  strings  and 
A  f. 


frequency  of  occurrence: 


Distinct  two-word  strings  and  frequency  of  occurrence: 

A  B  fAB 

Distinct  three-word  strings  and  frequency  of  occurrence: 


A  B  C  f 


'ARC 


Distinct  four-word  strings  and  frequency  of  occurrence: 


A  B  C  D 


fABCD 


To  be  placed  on  a  list  a  string  had  to  occur  a  minimum 
number  of  times.  These  minimum  frequencies  were  given  as  in¬ 
put  data  to  SQUISH.  The  minimum  frequencies  used  and  the  re¬ 
sulting  number  of  strings  produced  are  summarized  below. 

The  number  of  strings  is  approximate. 


23,600  distinct  one-word  strings  with  f^^  1 
48,000  distinct  two- word  strings  with  ^  2 
11,700  distinct  three-word  strings  with  f^^  ^  3 
3,  350  distinct  four-word  strings  with  *^gCD^  3 
b.  Production  of  Final  Context  Tapes 


The  list  of  one-word  strings  produced  by  SQUISH  is  in  the 
form  desired  for  the  one-word  context  list  and  hence  the  tape 
containing  that  list  is  the  one-word  context  tape.  The  three 
remaining  context  lists  could  be  produced  from  the  information 
available  on  the  tapes  produced  by  SQUISH.  The  procedure 
for  producing  these  desired  lists  was  to  obtain  the  frequency 
of  substrings  of  a  K-word  string  from  available  frequency 
information  about  K-l  word  strings.  The  frequencies  of  all 
substrings  of  a  string  were  available  since  in  this  case  the 
minimum  frequency  of  a  K-word  string  is  greater  than  or  equal 
to  the  minimum  frequency  of  a  K-l  word  string.  As  pointed 
out  in  the  previous  section 


min 


|"fABCD 


3 


min 


3>  min 


1 

f  AB 

=  2  >  min4 

M 

J 

J 

1. 
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(1)  First  Merge  Phase 

Two  7090/94  Computer  programs  were  designed  to  do 
"merging".  The  first  "merging"  operations  were  carried 
out  by  a  7090  computer  program  MERGE  I  which  produced 
from  the  four  tapes  generated  by  SQUISH  three  tapes  desig 
nated  here  by  T2',  T3'  and  T4'  whose  content  can  be  sum¬ 
marized  in  the  following  way: 

T2'  Two-word  string  tape  A  B  f  f 

A  AB 

T3'  Three-word  string  tape  A  C  C  fA  fAB  fABC 

T4 '  Four-word  string  tape  A  B  C  D  fA  fAfi  fABC  ^ABCD 

(2)  Second  Merge  Phase 

The  second  set  of  merging  operations  was  carried  out 
by  a  7090  program  MERGEG.  This  program  produced  from  a 
specially  ordered  K-word  string  tape  and  a  K-l  word  final 
context  tape  a  K  word  string  tape  with  all  the  desired 
frequency  information.  This  newly  generated  tape  when 
sorted  into  alphabetical  order  on  entire  word  strings 
was  the  final  K-word  context  tape. 


As  a  preliminary  step  of  data  preparation  for  MERGEG 
the  tapes  produced  by  MERGE  I  were  sorted  in  the  following 
way: 


T2'  was  sorted  into  alphabetical  order  on  the  second 
word. 

T3*  was  sorted  into  alphabetical  order  on  the  two 
terminal  words. 

T4'  was  sorted  into  alphabetical  order  on  the  three 
terminal  words. 


The  IBSYS  SORT  program  was  used, 
will  be  designated  here  as  ST2 ' ,  ST3' 


The  resultant  tapes 
and  ST4 ' . 


(3)  Final  Two-Word,  Three-Word  and  Four-Word  Context  Tapes 


MERGEG  and  the  IBSYS  SORT  programs  were  then  used 
alternately  to  produce  the  desired  lists.  The  final  one- 
word  context  tape  and  ST2 '  were  input  to  MERGEG  which 
produced  a  tape  containing  for  every  entry: 
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This  output  tape,  ordered  like  ST2 ' ,  when  sorted  into  al¬ 
phabetical  order  on  the  entire  word  string  was  the  final 
two-word  context  tape.*  MERGEG  then  operated  on  the 
final  two-word  context  tape  and  ST3*  to  produce  a  tape 
which  contained  for  every  entry: 


A  B  C  f. 


AB  BC  ABC . 


This  tape,  ordered  like  ST3*  ,  when  sorted  into  alphabeti¬ 
cal  order  on  the  entire  word  string  was  the  final  three- 
word  context  tape.*  The  final  three-word  context  tape 
and  ST41  were  used  by  MERGEG  to  produce  a  tape  which 
contained  for  every  entry: 

A  B  C  D  fA  fB  fc  fD  fAB  fBC  fCD  fABC  fBCD  fABCD. 

This  tape,  ordered  like  ST4" ,  when  sorted  into  alphabeti¬ 
cal  order  on  the  entire  word  string  was  the  final  four- 
word  context  tape. 


3.  Indexing  Documents 

To  produce  a  collection  of  abstracts  automatically  indexed  by 
single  words  selected  from  text,  frequency  information  was  used 
for  selecting  index  terms  and  a  computer  program  was  disegned  to 
assign  the  resulting  index  terras  to  the  abstracts  in  which  they 
appeared . 

a.  Selection  of  Index  Terms 


The  index  terms  were  selected  from  among  the  one-word 
contexts  by  the  following  criteria: 

The  singular  and  plural  forms  of  words  occurred  jointly 
a  total  of  56  or  more  times  in  the  text; 

The  first  character  of  the  word  was  a  letter  A  through  Z. 

This  latter  constraint  eliminated  exception  words  and 
numbers  from  the  set  of  index  terms. 

Using  these  criteria,  1434  distinct  words  were  selected, 
counting  both  singular  and  plural  forms.  After  coalescing 
singular  and  plural  forms,  that  is,  assigning  the  same  repre¬ 
sentative  form  to  both  the  singular  and  plural  form  of  a  word, 
there  were  999  representative  index  terms.  If  a  word  occur¬ 
ring  in  the  list  of  1434  terms  was  found  in  an  abstract,  the 
representative  term  in  the  list  of  999  terms  was  assigned  as 
an  index  term.  The  set  of  index  terms  each  with  its  repre¬ 
sentative  was  recorded  on  a  dictionary  tape. 

b.  Indexing  Procedure 

A  7090  computer  program  INDEX  was  designed  which  assigned 

_ an  index  term  to  an  abstract  if  the  term  appeared  in  the  ab- 

*  The  number  of  entries  on  these  tapes  is  the  same  as  the  number  on  the 
corresponding  1,  2,  3,  and  4-word  tapes  produced  as  the  output  of  SQUISH. 
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stract.  From  the  four-word  string  tape  produced  by  CNTXT 
and  a  tape  of  dictionary  terras  selected  according  to  the 
above  criteria,  the  program  produced  a  tape  containing  abstract 
number- term  number  pairs.  This  tape  contained  about  222,000 
pairs,  that  i^,  about  220,000  words  in  the  text  were  index 
terms.  This  tape  was  then  sorted  into  abstract  number  term 
number  order  and  duplicate  pairs  were  eliminated  by  a  7090/94 
computer  program  ELIM.  Hence  if  an  index  term  appeared  more 
than  once  in  an  abstract,  only  one  pair  entry  was  retained. 

This  pair  tape  was  then  used  by  an  existing  program  to  pro¬ 
duce  a  Packed  Document  Term  Matrix.  This  Packed  Document 
Term  Matrix  was  the  input  to  a  set  of  existing  programs  which 
produced  a  document- term  tape  and  a  word-association  matrix 
for  use  in  retrieval.  Among  the  first  steps  in  the  produc¬ 
tion  of  the  association  matrix,  index-word  pairs  were  genera¬ 
ted  based  on  co-occurrence  of  index  terms  within  an  abstract. 
About  2,933,  000  index-word  pairs  were  generated  for  this  data 
base . 
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FUNCTION  WORDS* 

The  following  words  comprised  the  exception  list  of  function  words  in  the 
processing  which  was  described  In  Technical  Note  CACL-13.  We  are  indebted 
to  H.  Rubensteln  for  this  list. 


A.  Alphabetic  Listing  of  Function  Words 


ABOUT 

BE 

HAD 

ABOVE 

BETWEEN 

HARDLY 

ACROSS 

BEYOND 

HAS 

AFTER 

BOTH 

HAVE 

AGAINST 

BUT 

HAVING 

ALL 

BY 

HENCE 

ALMOST 

CANNOT 

HEREIN 

ALONE 

CAN 

HERE 

ALONG 

COULD 

HER 

ALSO 

DID 

HERSELF 

ALTHOUGH 

DOES 

HE 

ALWAYS 

001 NG 

HIM 

AMONG 

DONE 

HIMSELF 

AM 

DO 

HIS 

AND 

OOWN 

HITHER 

ANOTHER 

OURING 

HOWBEIT 

AN 

EACH 

HOWEVER 

ANYBODY 

EITHER 

HOW 

ANYONE 

ELSE 

IF 

ANY 

ELSEWHERE 

INASMUCH 

ANYTHING 

ENOUGH 

INDEED 

ANYWHERE 

ETC 

INNER 

APART 

EVEN 

IN 

ARE 

EVER 

INSOFAR 

AROUND 

EVERYONE 

INSTEAD 

A 

EVERY 

INTO 

ASIDE 

EVERYTHING 

INWARD 

AS 

EVERYWHERE 

I 

AT 

EXCEPT 

IS 

AWAY 

FEW 

IT 

AWFULLY 

FOR 

ITSELF 

BECAUSE 

FORTH 

ITS 

BEEN 

FROM 

JUST 

BEFORE 

FURTHERMORE 

KEEP 

BEHIND 

GET 

KEPT 

BEING 

GETS 

LEAST 

BELOW 

GOT 

LESS 

Issued  on  April  15,  1965  to  a  limited  distribution  by  Paul  R.  Jones 
as  a  Supplement  to  Technical  Note  CACL-13 
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LEST 

MANY 

MAY 

ME 

MIGHT 

MINE 

MOREOVER 

MORE 

MOST 

MUCH 

MUST 

MY 

MYSELF 

NEITHER 

NEVERTHELES 

NEXT 

NOBODY 

NONE 

NOR 

NO 

NOTHING 

NOT 

NOWHERE 

OF 

OH 

ONE 

ONES 

ONLY 

ON 

ONTO 

OR 

OTHER 

OTHERS 

OTHERWISE 

OUGHT 

OUR 

OURSELVES 

OURS 

OUTSIDE 

OVER 

OWN 

PER 

PLEASE 


PLUS 

QUITE 

RATHER 

REALLY 

RIGHT 

SELF 

SELVES 

SEVERAL 

SHALL 

SHE 

SHOULD 

SINCE 

SIX 

SOMEBODY 

SOME 

SOMETHING 

SOMETIMES 

SOMEWHAT 

SO 

STILL 

SUCH 

TEN 

THAN 

THAT 

THEIR 

THEIRS 

THEM 

THEMSELVES 

THENCE 

THEN 

THEREBY 

THEREFORE 

THERE 

THE 

THESE 

THEY 

THIS 

THOSE 

THOUGH 

THROUGHOUT 

THUS 

TOGETHER 

TOO 


TO 

TOWARD 

TWO 

UNDERNEATH 

UNDER 

UNLESS 

UNTIL 

UNTO 

UPON 

UP 

UPWARD 

US 

VERY 

WAS 

WELL 

WERE 

WE 

WHATEVER 

WHAT 

WHENCE 

WHENEVER 

WHEN 

WHERE 

WHEREVER 

WHETHER 

WHICH 

WHILE 

WHOM 

WHO 

WHOSE 

WHY 

WILL 

WITHIN 

WITHOUT 

WITH 

WOULD 

YES 

YET 

YOUR 

YOURSELF 

YOURSELVES 

YOURS 

YOU 
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B. 

Frequency  of  Function  Words 

in  Corpus  G.E.- 

•2 

A  partial 
frequent 

list  of  the  frequencies  of 
function  words  follows. 

occurrence  in  this  text 

OF 

30170 

THIS 

374 

AND 

16622 

NO 

372 

THE 

8911 

UP 

356 

IN 

8879 

WERE 

355 

TO 

8045 

THAN 

353 

FOR 

7406 

ITS 

347 

A 

6232 

OTHER 

331 

ON 

4634 

WHEN 

303 

WITH 

3844 

HAS 

292 

AT 

2770 

DURING 

286 

BY 

2759 

HAVE 

282 

FROM 

1807 

BOTH 

272 

IS 

1704 

I 

269 

AS 

1689 

THESE 

268 

AN 

1474 

ALL 

260 

WHICH 

1180 

IT 

260 

BE 

1077 

SEVERAL 

260 

ARE 

996 

MAY 

255 

THAT 

954 

BEEN 

250 

OR 

906 

ALSO 

241 

TWO 

899 

ONLY 

241 

BETWEEN 

676 

THEIR 

238 

UNDER 

596 

MORE 

225 

NOT 

558 

PER 

225 

SOME 

545 

SUCH 

222 

CAN 

479 

BUT 

212 

OVER 

424 

ABOUT 

204 

INTO 

410 

HAVING 

203 

WAS 

398 

WITHOUT 

184 

ONE 

381 

UPON 

175 

(Count  ■  60) 
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The  function  words  occurring  between  12  and  1  times  can  be  listed  in 
frequency  order. 


ALWAYS 

12 

NOTHING 

4 

WE 

12 

ONTO 

4 

GET 

11 

SOMETIMES 

4 

QUITE 

11 

YOU 

4 

FORTH 

9 

APART 

3 

INSTEAD 

9 

HENCE 

3 

JUST 

9 

NEITHER 

3 

NOR 

9 

UNLESS 

3 

EVERY 

8 

YOU 

3 

HEREIN 

8 

DOING 

2 

WHY 

8 

ELSEWHERE 

2 

YET 

8 

EVERYWHERE 

2 

AWAY 

7 

MINE 

2 

OUR 

7 

OH 

2 

THEREBY 

7 

THEMSELVES 

2 

WHO 

7 

UPWARD 

2 

NONE 

6 

WHATEVER 

2 

ONES 

6 

ANYTHING 

1 

SOMEWHAT 

6 

ASIDE 

1 

THOUGH 

6 

GETS 

1 

EVER 

5 

HER 

1 

INWARD 

5 

HIM 

1 

KEEP 

5 

INSOFAR 

1 

KEPT 

5 

LEST 

1 

OTHERWISE 

5 

SOMETHING 

1 

THEREFORE 

ME 

5 

4 

WHEREVER 

1 

(Count  “53) 
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ESTIMATING  RECURRENCE  OF  MISSPELLINGS 

IN  CORPUS  G.E.-2  * 


A.  Anomalies  among  Words  with  Frequency  2 

The  following  "words"  were  considered  anomalous  (by  me)  among  the  types 
which  occurred  twice  in  Corpus  G.E.-2. 


ABL 

AB 

ACIER 

ADVANM 

ADVANCE  ? 

AGCL 

AGIG 

ALCHOL 

ALCOHOL  ? 

ALCUMG 

ALLM 

ALUM  ? 

ALUN 

ALUM  ? 

ALYER 

LAYER  ? 

ANALYSI 

ANALYSIS  ?  or 

ANALYT 

ANAM 

ANF 

ANOCUT 

APPROXM 

APPROX.  ? 

APRALLEL 

PARALLEL  ? 

APUS 

ARO 

ARTIFICAL 

ARTIFICIAL  ? 

ASCAST 

AT320 

AT  320  ? 

AT423F 

AT  423  F  ? 

ATA 

AT  A  ? 

ATED 

-ATED  ? 

AVON 

AVOVE 

ABOVE  ? 

AWS 

BARIA 

BEHAVIOUS 

BEHAVIOUR  ? 

BOLTZMAN 

BOLTZMANN  ? 

*  Issued  on  April  16,  1965  to  *  limited  distribution  by  Paul  3.  Jones 
as  a  Supplement  to  Technical  Note  CACL-13 
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BOURDON 

BOVERI 

BRITTEL 

CALCU  LATI 0N0 

CALI B RAM 

CAPACITIVELY 

CAPACITIVE 

CARACTERI STIC 

CASCADEM 

CAVIATATION 

CECOSTAMP 

CENM 

CENT RATION 

CHARACTERM 

CHARACTERTIC 

CIRM 

C.D 

C.W 

CLOSEDM 

CLUDING 

COATINS 

COEFFICEINTS 

COMBUTION 

COMME 

COMMINUTED 

COMPLEETE 

COMPLES 

COMPRESSEUR 

COMPRESSORE 

COMPTOIR 

COMPUM 

CONCN 

CONDI 

CONDITIO 

CONDI TI 

CONDITONS 

COND 


BRITTLE  ? 
CALCULATION  OF  ? 

7 

7 

CHARACTERISTIC  ? 
CASCADED  ? 
CAVITATION  ? 

7 

CENT  ? 

CHARACTER  ? 
CHARACTERISTIC  ? 

C.D.  ? 

C.W.  ? 

CLOSED  ? 
INCLUDING  ? 
COATINGS  ? 
COEFFICIENTS  ? 
COMBUSTION  ? 


French  ? 
Italian  ? 


CONDITION  ? 
u 

It 

CONDITIONS  ? 
CONDITION  ? 


Summary,  450  words  checked,  71  probable  misspellings.  (16  per  100) 


B.  Anomalies  Among  Words  with  Frequency  3 

AEDC 

ATI 0 NS  -ATIONS  ? 

ATURE  -ATURE  ? 

AUBES 
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B 

BRER 

BULENT 

BURING 

BUTION 

CERTAINS 

CHAUD 

CLUDED 

OOEFFICIEN 

COMMERICAL 

COMPLICANCE 

COMPUTOR 

COMUSTION 

CONDI  M 

CONSIDM 

CONSTIM 

DEFORMAM 

DESM 

DETERMING 


300  checked,  22  probable  errors  (7.3  per  100) 


C.  Anomalies  Among  Words  Occurring  5  Times 

APPENDIXES 

ARY 

BLASIUS 

BUNA 

DIFFERM 

ENM 

ENTRE 

FORMANCE 

ISTICS 

LATED 


300  tested,  9  anomalies  (3  per  100) 

D.  Anomalies  Among  Words  of  Frequency  8 


ATI  ON 

-ATION  ? 

CD 

FIZ 

ME  NTS 

-MENTS  ? 

MISSLE 

MISSILE  ? 

PREM 

SNECMA 
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THERMDM 

TRANSM 


300  tested,  9  anomalies  (3  per  100) 


E.  Anomalies  Among  Words  of  Frequency  10 

LETUDE 

NIM0NIC 

ONERA 

PERM 


~300  tested  frequency  10  plus  above,  4  anomalies 


F.  Estimating  the  Error  Rate  in  Corpus  G.E.-2 

It  would  be  useful  to  have  an  estimate  of  the  number  of  misspelled  words 
per  100  words  of  running  text.  An  estimate  of  this  can  be  obtained  very 
crudely  using  the  data  so  far  acquired  if  they  are  plotted  as  in  the 
attached  figure. 

The  plot  shows  the  error  rates  in  each  of  the  frequency  classes  examined, 
and  the  straight  line  shows  an  approximate  lower  bound  on  these  error 
rates.  We  can  calculate  the  number  of  misspelled  tokens  in  each  of  the 
frequency  classes  by 

letting  *  no.  of  types  with  frequency  f 

f  *  no.  of  text  instances  (tokens)  of  a  type  with 

frequency  f 

m  error  rate  of  types  of  frequency  f 

If  we  multiply  Tf  by  f  we  get  the  number  of  tokens  for  words  of 
frequency  f  .  The  product 

f(Tf)  (£  f)  gives  the  desired  number, 

the  number  of  misspelled  tokens  contributed  by  the  class  of  types  with 
frequency  f  . 

By  summing  from  f*l  to  10  we  can  count  an  estimated  lower  bound  on 
the  total  number  of  misspelled  tokens  (for  we  can  assume  that  the  num¬ 
ber  of  errors  repeated  with  frequency  greater  than  10  is  negligible). 
Since  there  were  ^ 446, 000  tokens  in  the  text,  we  can  estimate  the 
error  rate 
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(error  rate  r  in  misspellings  per  100  tokens)  * 

10 

^  f(£f)  (Tf ) 

f«l _ 

446,000 


f 

ii 

Tf 

f(£f>  (Tf 

1 

.42 

12,485 

5250 

2 

.16 

2,929 

960 

3 

.07 

1,370 

288 

4 

.04 

824 

132 

5 

.03 

631 

100 

6 

.02 

422 

50 

7 

.02 

346 

50 

8 

.02 

310 

50 

9 

.02 

268 

50 

10 

.01 

221 

_ 22 

6942 

estimate 

of  the  error 

rate 

is 

thus 

7,000 

*  1.5  errors 

per 

100 

words  of  running  text 

446,000 

(approx) 


As  a  rule  of  thumb,  we  probably  have  1/2  misspelled  word  per  abstract 
on  the  average. 
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DISTRIBUTION  OF  TERM-SET  SIZE 
FOR  THE  G.E.-2  AUTO- INDEXED  COLLECTION  * 


A,  Introduction 


As  described  in  detail  in  Technical  Note  CACL-13  ,  the  10,287  abstracts 
in  Corpus  G.  E.  2  were  automatically  indexed  by 

1.  Forming  a  keyword  vocabulary  consisting  of  1434  word 
forms  corresponding  to  the  999  most  frequent  (non- 
function-word)  types  in  the  text  of  the  collection, 
about  446,000  running  words). 

2.  Indexing  an  abstract  by  the  set  of  types  appearing 

in  it.  Once  a  type  has  been  assigned  to  an  abstract, 
the  type  can  be  called  a  term. 

The  result  of  this  indexing  is  the  production  of  a  matrix  C  --a  binary 
(10,287  x  999)  matrix,  each  row  of  which  has  nonzero  entries  correspond¬ 
ing  to  the  terras  assigned  to  one  of  the  abstracts.  The  total  number  of 
terms  assigned  to  an  abstract  by  the  automatic  indexing  process  is 
obtained,  of  course,  by  counting  the  number  of  nonzero  entries  in  the 
row,  and  this  number  is  defined  to  be  the  term-set  size  for  the  given 
abstract . 

The  distribution  of  the  term-set  size  over  the  collection  at  hand  is  an 
important  parameter  both  for  describing  the  indexing  of  the  collection 
and  for  computing  (and  interpreting)  the  term  association  measures  that 
are  generated.  It  will  be  recalled  that  the  Linear  Associative  Model 
makes  provision  for  a  normalization  for  ‘'document  length"  (i.e.,  term-set 
size)  to  account  for  the  belief  that  the  cooccurrence  of  two  terms  in 
a  "long"  document  is  less  weighty  than  the  cooccurrence  of  two  terms  in 
a  "short"  one.  Strict  adherence  to  the  model  would  require  accounting 
for  variations  in  "document  length"  in  computing  the  associations. 


Issued  by  Paul  E. Jones  to  a  limited  distribution  on  May  4,  1965  as 
Technical  Note  CACL-16 

See  Volume  II. 


-23- 


Section  I:  CACL-16 


Our  computer  programs  for  computing  associations  do  not,  strictly  speaking, 
embody  this  adjustment  for  variations  in  term-set  size.  We  weight  all 
cooccurrences  equally,  and  we  normalize,  in  practice,  by  the  cooccurrence 
total.  Use  of  the  computer  programs  thus  embodies  the  assumption  that 
all  term-sets  are  roughly  the  same  size.  In  previous  work  this  has  been 
assured  by  the  procedures  we  have  followed  (e.g.,  in  G.E.-l  by  throwing 
out  all  documents  with  fewer  than  7  terms).  But  in  the  G.E.-2  collec¬ 
tion  no  such  constraints  were  used;  an  unfavorable  distribution  of  term- 
set  size  could  thus  conceivably  have  emerged. 

In  manually-indexed  collections  we  have  reason  to  believe  that  there  are 
historical  trends  in  term-set  size.  B.  Dennis  of  G.E.  has  reported 
(private  communication  to  V.  E.  Giuliano)  that  early  in  the  formation  of 
the  G.E.  collection,  indexers  assigned  relatively  fewer  terms  to  documents 
than  they  did  later  on.  In  manual  indexing,  the  average  number  of  terms 
per  document  increased  with  time,  presumably  because  there  was  increas¬ 
ing  need  for  a  finer  description  of  document  contents  as  the  collection 
increased  in  size. 

Although  this  observation  applies  to  manually-assigned  terms,  one 
wonders  whether  a  corresponding  trend  might  not  be  apparent  in  the 
automatically-indexed  collection.  One  can  easily  speculate  that  when 
the  collection  was  first  formed,  several  thousand  documents  on  mis¬ 
cellaneous  subjects  were  on  hand  and  were  encompassed  in  the  growing 
collection  before  the  theme  of  the  collection  had  a  chance  to  develop. 
Later,  when  the  aerospace-metallurgy  theme  became  dominant  this  initial 
transient  could  be  expected  to  fade.  The  administrators  of  the  collec¬ 
tion  would  be  expected  to  be  more  clearly  selective  about  documents  that 
belonged  in  the  collection.  To  the  extent  that  a  high  proportion  of 
the  early  documents  deal  with  miscellaneous  subjects  (like  agriculture) 
in  which  aerospace  vocabulary  is  not  employed,  they  would  fail  to  contain 
the  set  of  keywords  that  comes  to  be  repeated  frequently  later.  We  would 
thus  expect  that  the  early  messages  might  tend  to  have  fewer  keywords  in 
their  abstracts  than  later  ones  do. 

Indeed  it  is  possible  to  conjecture  (pessimistically)  that  the  distribu¬ 
tion  of  term-set  size,  over  the  collection  as  a  whole,  is  bimodal .  If 
early  documents  are  MshortM  while  later  ones  are  Mlong,M  this  could 
in  principle  lead  to  the  presence  in  the  collection  of  a  large  number  of 
very  short  (e.g.,  3  terms)  documents,  together  with  a  very  large  number 
of  "long"  (e.g.,  20  terms)  documents,  and  very  few  in  between.  Clearly 
the  idea  that  all  term-sets  are  more-or-less  the  same  size  would  not  be 
tenable  in  this  case. 


B.  The  Distribution  of  Term-Set  Size  as  a  Function  of  Accession  Number 

The  availability  of  the  C  matrix  for  Corpus  G.E. -2  permitted  an 
analysis  to  be  made  of  the  behavior  of  term-set  size  as  a  function  of 
accession  number.  For  intervals  of  200  documents,  a  count  was  made  of 
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the  number  of  documents  with  "length"  i  (i-1,  -  50)*.  No  documents 

of  length  0  were  encountered,  nor  were  any  documents  with  more  than  50 
keywords  observed. 

Figure  1  shows  a  plot  of  the  distribution  of  message  "length"  over  the 
whole  collection.**  The  shape  of  the  distribution  is  outlined  by  the 
line  segments  connecting  the  dots.  By  inspection,  the  collection  as  a 
whole  shows  a  distribution  of  message  length  that  can  be  regarded  as 
close  enough  to  normal  to  dispel  worries  about  bimodality.  Nevertheless, 
there  seems  to  be  a  slight  tendency  towards  peaking  at  20  terms. 

Superimposed  on  this  information  are  plots  of  the  distributions  for  the 
first  half  ot  the  collection  taken  alone  (bar  chart),  and  for  the  first 
1/4  of  the  collection  (crosses).  A  slight  shift  to  "shorter"  messages 
among  early  accessions  is  seen. 

To  check  the  extent  of  shift  in  the  message  length  between  early  and 
recent  documents,  Figure  2  shows  the  distribution  for  the  first  1000 
documents  overlaid  on  that  for  the  last  1000  documents.  We  see  that 
the  early  segment  of  documents  tends  to  have  one  or  two  fewer  terms  per 
document  than  the  late  segment  does.  This  difference  is  not  considered 
sufficiently  large  to  be  worth  pursuing  further. 


C.  Conclusions 


a.  For  the  automatically- indexed  collection,  the  hypothesis 
that  the  term-set  size  exhibits  a  clear  trend  to  in¬ 
creasing  "length"  as  a  function  of  accession  number 

is  partially  supported,  but  the  trend  is  considered 
negligible.  (Note  that  since  we  have  used  a  sampling 
technique,  our  figures  for  1000  G.E.-2  messages  re¬ 
present  7,500  of  the  original  documents.  While  a 
major  trend  is  not  apparent  in  our  subset,  I  would 
not  be  surprised  to  see  a  trend  within  the  first 
7,500  G.E.  documents.) 

b.  The  distribution  of  message  length  is  not  bimodal 
over  the  collection  as  a  whole. 

c.  The  average  number  of  terms  per  document  over  the 
collection  as  a  whole  is,  accurately: 


This  data  is  available  on  a  computer  printout.  Because  it  is  50  pages 
long,  it  is  only  summarized  here  for  the  collection  as  a  whole. 

** 

The  tallies  are  shown  in  Table  1. 
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Number  of  Terms  Assigned  =  Message  "Length" 


FIGURE  I 


DISTRIBUTION  OF  AUTO-INDFX  TERMS  PER  MESSAGE  FOR  THE  GE-2  COLLECTION 
AND  ITS  INITIAL  FRAGMENTS 
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L 

Message 

Length 

N 

No.  Docs 
This  Long 

LN 

No. 

Dt  Pairs 

l2n 

Full  No.  of 
TT  Pairs 

1 

0 

0 

2 

2 

4 

8 

3 

5 

15 

45 

4 

12 

46 

192 

5 

35 

175 

875 

6 

58 

348 

2068 

7 

83 

581 

4067 

8 

131 

1048 

8364 

9 

207 

1863 

16767 

10 

349 

3490 

34900 

11 

444 

4884 

53724 

12 

574 

6888 

82656 

13 

706 

9178 

119314 

14 

758 

10612 

148568 

15 

874 

13110 

196650 

16 

905 

14480 

231680 

17 

825 

14025 

238425 

18 

807 

14526 

261468 

19 

709 

13471 

255949 

20 

717 

14340 

286800 

21 

541 

11361 

238581 

22 

412 

9064 

199408 

23 

337 

7751 

178273 

24 

230 

5520 

132480 

25 

202 

5050 

126250 

26 

116 

3016 

78416 

27 

89 

2403 

64861 

28 

61 

1708 

47824 

29 

39 

1131 

32799 

30 

17 

510 

15300 

31 

13 

403 

12493 

32 

7 

224 

7168 

33 

8 

264 

8712 

34 

3 

102 

3468 

35 

2 

70 

2450 

36 

2 

72 

2592 

37 

3 

111 

4107 

38 

1 

38 

1444 

39 

0 

0 

40 

0 

0 

41 

0 

0 

42 

0 

0 

43 

2 

86 

3698 

44 

0 

0 

45 

0 

0 

*6 

1 

46 

2116 

TOTALS  .  172016  3105020 


TABLE  1 


TABULATION  OF  THE  DISTRIBUTION  OF  MESSAGE  LENGTH 
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NOMfsi  a*wgH*4  =  uewjTS 
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172016 

10287 


16.72  ■  aug.  terms  per  document. 


d.  The  total  number  of  term  pairs  for  the  collection  as 
a  whole  is,  accurately: 

L2N  -  LN  *  2,933,004 


This  corresponds  to  285  pairs  per  document. 
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COMPARISON  OF  MANUAL  AND  MACHINE  SELECTED  VOCABULARIES 


Similarities  and  Differences  of  Machine  Selected  GE  2  Indexing 
Vocabulary  and  Manual  Uniterm  Vocabulary* 


The  purpose  of  this  Technical  Note  is  to  compare  certain  gross  features 
of  the  machine-selected  G.E. -2  associative  indexing  vocabulary  against 
those  of  the  UNITERM  vocabulary  used  to  index  the  parent  G.E.  collection. 


I.  BACKGROUND 
A.  Parent  G.E.  Collection 

The  G.E. -2  collection  consists  of  10,289  abstracts  indexed  by  999  machine- 
selected  "association"  terras  representing  1,434  singular  and  plural  word 
forms.  The  abstracts  were  selected  by  sampling  from  a  larger  collection 
which  we  purchased  in  1962  from  General  Electtic  Co,  This  parent  collec¬ 
tion  consists  of  some  70,000  document  surrogates,  each  consisting  of  a 
document  number  and  a  set  of  assigned  descriptors.  Actual  abstracts 
for  only  about  45,000  of  the  documents  have  been  made  available  to  us. 
These  documents  were  indexed  manually  by  G.E.  using  UNITERM  descriptors; 
about  4,780  descriptors  were  used  to  index  the  parent  collection,  with 
the  average  number  of  index  terms  per  document  being  12.5.  We  have 
available  a  list  of  these  UNITERMS  and  their  use  frequencies  in  the 
parent  collection;  we  also  have  available  on  magnetic  tape  all  70,000 
surrogates,  the  45,000  abstracts  and  various  auxiliary  data  relating 
to  the  parent  G.E.  collection. 


B.  G.E. -2  Collection 


The  10,289  abstracts  G.E. -2  subcollection  was  selected  by  machine.  The 
basic  procedure  used  was  to  select  every  fourth  abstract  out  of  the 
45,000  at  hand,  but  skipping  those  less  than  six  lines  long.  Since  the 
parent  collection  was  ordered  according  to  accession  date,  the  G.E. -2 
subcollection  represents  a  rather  uniform  sampling  in  time  of  the  entire 
parent  collection. 

*  Issued  on  April  30,  1965  to  a  limited  distribution  by  Janet  J.  Foster 
and  Vincent  E.  GiuH pno  «0  Technical  Note  CACL-15.  References 
only  have  been  updated. 
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Details  of  preparation  of  the  G.E.-2  corpus  have  been  given  in  Technical 
Note  CACL-13.  The  10,289  abstracts  provide  roughly  446,000  words  of 
running  text.  If  a  non-function  word  occurs  in  singular  and  plural 
forms  a  total  of  at  least  56  times,  it  is  used  as  an  Massocia tionH 
index  term,  and  there  are  999  such  terms.  A  much  larger  set  of  words 
and  word  strings  may  also  be  used  as  auxiliary  "coordinate"  index  terms, 
but  these  are  not  of  concern  here. 


II.  OVERLAP  OF  VOCABULARIES 


Basically,  the  matter  of  concern  of  this  memorandum  is  the  extent  to 
which  the  two  vocabularies  in  question-- the  UNITERM  vocabulary  of  the 
parent  collection  and  the  machine-selected  G.E.-2  vocabulary--are  alike 
or  different.  The  vocabularies  were  compared  manually  to  the  extent 
that  this  is  possible  without  reference  to  specific  documents,  and 
results  are  summarized  here. 

A.  Inclusion  of  G.  E.  -2  Terms  in  the  UNITERM  Set 


The  first  question  asked  was  nHow  many  of  the  G.  E.-2  terms  are 
either  identical  with  or  close  cognates  of  terms  in  the  UNITERM  set?" 

The  999  G.  E.-2  terms  were  classified  into  one  of  the  following  four 
categories : 

C  The  G.  E.-2  term  is  either  a  UNITERM  or  is  the  plural  of  a 
UNITERM. 

X-C  The  G.  E.-2  term  is  not  in  the  UNITERM  list  but  has  the 

same  morphological  stem  as  a  UNITERM  and  is  closely  related 
to  it  in  meaning.  (Examples:  accelerate  and  acceleration; 
grow  and  growth;  capable  and  capability.)  Such  terms  will 
be  referred  to  here  as  morphological  cognates  (listed  in 
Appendix  A) . 

X  The  G.  E.-2  term  is  neither  in  the  UNITERM  list  nor  a  morpho¬ 
logical  cognate  of  a  UNITERM.  (Listed  in  Appendix  B.) 

X-G  Not  otherwise  classified.  There  are  39  unusually  short 

terms  in  the  G.  E.-2  list  which  were  not  classified  because 
of  the  difficulty  of  determining  their  meaning  out  of  context. 
These  terms  contain  between  1  and  3  letters,  and  in  many 
cases  are  abbreviations  of  words  in  the  UNITERM  vocabulary. 
These  terms  could  readily  be  deleted  by  machine,  if  desired, 
on  the  basis  of  their  length.  Percentage  figures  given  in 
the  remainder  of  this  memorandum  exclude  these  39  terms. 
(Appendix  C) 
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We  accumulated  information  as  to  both  the  number  of  types  and  number 
of  token  usages  of  words  in  the  categories,  the  results  being: 


TABLE  1 


G.E.-2  TERMS 

INCLUDED  IN  THE 

UNITERM  VOCABULARY 

Type 

7.  of 

Token 

7.  of 

Category 

Frequency 

Type  8 

Frequency 

Tokens 

C 

730 

767. 

168,710 

847. 

X-C 

165 

177. 

22,332 

117. 

X 

65 

77. 

8,958 

57. 

Note:  Only  type  X-C,  X  and  X-G  tokens  were  counted  in  the 

G.E.-2  collection.  Total  token  usage  was  estimated 
by  multiplying  10,289  (number  abstracts)  times  19.5 
(estimated  number  tokens  per  abstract,  based  on  15 
distinct  term  types  per  abstract  and  recurrence 
factor  of  terms  being  1.3)  giving  a  total  of  about 
200,000  token  usages. 


It  is  of  interest  that  such  a  high  percentage  of  the  G.E.-2  terms  are 
included  in  the  UNITERM  vocabulary,  particularly  considering  actual 
token  usages  in  indexing.  The  847.  inclusion  suggests  that  a  simple 
machine-derived  indexing  vocabulary  need  not  be  very  different  in  for¬ 
mal  makeup  than  a  UNITERM  vocabulary  consciously  selected  by  indexers. 
It  is  of  interest  to  analyze  further  the  machine-selected  G.E.-2  terms 
which  are  not  in  the  UNITERM  vocabulary. 


III.  ANALYSIS  OF  DIFFERENCES 

A.  Morphological  Cognates 

Most  of  the  UNITERMS  are  nouns,  but  many  of  the  morphological  cognate 
X-C  terms  in  the  GaE.-2  collection  are  not.  We  broke  the  list  of  X-C 
cognate  words  down  according  to  whether  the  G.E.-2  term  is  likely  to 
be  primarily  a  noun,  a  verb ,  a  verbal  (i.e.,  a  participle)  or  a 
modifier.  The  figures  for  breakdowns  into  these  categories  are: 
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TABLE  2 

MORPHOLOGICAL  COGNATES 


X-C 

Sub-Category 

Type 

Frequency 

7.  of  All 
Types 

Token 

Frequency 

7.  of  All 
Tokens 

Noun 

31 

37. 

3,770 

1.97. 

Verb 

9 

17. 

1,202 

.67. 

Verbal 

93 

107. 

12,565 

6.37. 

Modifier 

32 

37. 

4,795 

2.47. 

Total  X-C 

165 

177. 

22,332 

11.27. 

The  set  of  165  type  X-C  words  together  with  the  UNITERM  cognates  are 
exhibited  in  Appendix  A;  also  shown  are  token  frequencies  in  both  the 
G.E.-2  and  the  parent  collection.  In  the  majority  of  cases  the  dif¬ 
ference  between  the  text  term  and  UNITERM  is  primarily  one  of  grammatical 
form  rather  than  meaning.  Detailed  inspection  of  the  lists  lends  very 
strong  credence  to  the  notion  that,  in  the  large  majority  of  cases, 
indexers  used  a  morphological  cognate  which  was  an  acceptable  UNITERM 
in  place  of  the  X-C  text  term  actually  used  in  the  text  of  an  abstract. 
While  this  notion  is  both  plausible  and  supported  by  the  data,  valida¬ 
tion  of  it  will  require  inspection  of  the  detailed  indexing  of  a  sizable 
number  of  abstracts;  this  has  not  been  done  so  far. 

Regardless  of  hew  individual  abstracts  may  be  indexed,  the  data  of 
Appendix  A  does  suggest  that  a  concept  mentioned  in  a  document  can  be 
referred  to  either  by  a  standard  UNITERM  (in  the  case  of  manual  index- 
ing)  or  by  other  cognate  terms  which  do  appear  in  text  (in  the  case  of 
machine  indexing).  Suppose  now  that  one  started  with  a  UNITERM  vocabul¬ 
ary  and  attempted  to  do  automatic  indexing  by  means  of  searching  for 
UNITERMS  in  text.  The  data  in  the  above  table  suggests  that  at  least 
117*  of  the  concepts  mentioned  in  the  Abstracts  would  not  be  indexed 
using  this  procedure,  for  117*  of  the  occurrences  of  concept-bearing 
text  words  would  be  variants  of  acceptable  UNITERMS,  and  therefore  not 
recognizable  by  simple  automatic  means. 


B.  Text  Words  Without  Uniterm  Cognates 

The  sixty  five  G.E.-2  terms  which  are  neither  UNITERMS  nor  which  have 
UNITERM  morphological  cognates  are  listed  in  Appendix  B.  They  are 
broken  down  according  as  to  whether  they  are  noun ,  modifier ,  verb ,  verbal 
or  adverb ,  giving  the  following  data. 
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Sub-Category 

Type 

Frequency 

7.  of  All 
Types 

TOtkl  Token 
Frequency 

7.  of  All 
Token 

Occurrences 

Noun 

19 

1.97. 

1,922 

1.07. 

Verb 

9 

.97. 

1,748 

.97. 

Verbal 

13 

1.37. 

2,786 

1.47. 

Modifier 

21 

2.17. 

2,305 

1.27. 

Adverb 

3 

.37. 

197 

.17. 

TOTAL 

65 

6.57. 

8,958 

4.67. 

Words  of  this 

type  (X)  are  of  . 

a  general  nature  and  can  only  be  of  little 

value  when  used  singularly  for 

retrieving 

messages  from 

the  present 

specialized  collection.  Possibly  this  is  the  reason  why  indexers  of 
the  G.E.  collection  chose  not  to  make  them  UNITERMS  despite  their  ten¬ 
dency  to  occur  frequently  in  text.  However,  such  terms  could  conceivably 
be  of  value  when  combined  with  other  terras  having  more  specific  denota¬ 
tions.  In  any  event,  the  usefulness  of  such  general  terms  is  at  present 
unclear,  and  will  require  further  evaluation. 
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APPENDIX  A 

X-C  WORDS 


G.E.-2 

X-C  WORDS 

G.E.-2 

FREQUENCY 

UNITERM 

COGNATE 

PARENT 

COLLECTION 

FREQUENCY 

I .  NOUNS 

acceleration 

98 

accelerate 

(791) 

accuracy  (ies) 

111 

accurate 

(331) 

amplifier  (s) 

95 

amplify 

(609) 

casting  (s) 

105 

cast 

(1023) 

coating  (s) 

425 

coat 

(1790) 

combination  (s) 

152 

combine 

(172) 

conductivity 

143 

conduction 

(1208) 

content  (s) 

142 

contain 

(172) 

correlation  (s) 

167 

correlate 

(220) 

curvature 

57 

curvilinear 

(13) 

deposition 

69 

deposit 

(572) 

derivative  (s) 

62 

derivation 

(1223) 

diffuser  (s) 

106 

diffuse 

(1419) 

efficiency  (ies) 

232 

efficient 

(1141) 

embrittlement 

64 

embrittle 

(318) 

environment  (s) 

115 

environ 

(561) 

feasibility 

59 

feasible 

(68) 

forging  (s) 

99 

forge 

(784) 

formation  (s) 

159 

form 

(662) 

generation 

61 

generate 

(330) 

growth 

73 

grow 

(315) 

injection 

89 

inject 

(1080) 

instability  (ies) 

118 

instable 

(248) 

instrumentation 

91 

instrument 

(2017) 

loading  (s) 

205 

load 

(2567) 

presence 

94 

present 

(31) 

production 

221 

produce 

(455) 

protection 

59 

protect 

(479) 

quantity  (ies) 

73 

quantum 

(45) 
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G.E.-2 

G.E.-2 

UNITERM 

PARENT 

COLLECTION 

X-C  WORDS 

FREQUENCY 

COGNATE 

FREQUENCY 

I.  NOUNS  (Cont.) 

selection 

113 

select 

(184) 

separation 

113 

separate 

(595) 

II.  VERBS 

appear  (s) 

62 

appearance 

(7) 

consist  (s) 

71 

constitution 

(39) 

depend  (s) 

65 

dependent 

(56) 

describe  (s) 

157 

description 

(85) 

determine  (s) 

372 

determination 

(402) 

evaluate  (s) 

76 

evaluation 

(655) 

improve  (s) 

61 

improvement 

(98) 

include  (s) 

271 

inclusion 

(97) 

relate  (s) 

67 

relation 

(166) 

III.  VERBALS 

advanced 

65 

advancement 

(175) 

aging 

ets 

age 

(766) 

analyzed 

112 

analysis 

(4333) 

applied 

274 

application 

(714) 

assumed 

133 

assumption 

(59) 

based 

335 

base 

(793) 

bending 

193 

bend 

(1259) 

boiling 

113 

boil 

(371) 

bonded 

59 

bond 

(725) 

bonding 

71 

bond 

(725) 

calculated 

174 

calculation 

(1966) 

calculating 

89 

calculation 

(1966) 

carried 

72 

carrier 

(110) 

caused 

67 

cause 

(20) 

closed 

87 

close 

(343) 

combined 

103 

combine 

(172) 

compared 

280 

comparison 

(333) 

computed 

66 

computation 

(867) 
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•  G.E.-2 

G.E.-2 

UNITERM 

PARENT 

COLLECTION 

X-C  WORDS 

FREQUENCY 

COGNATE 

FREQUENCY 

III.  VERBALS  (Cont.) 

computing 

65 

computation 

(867) 

conducted 

181 

conduction 

(1208) 

conducting 

83 

conduction 

(1208) 

consisting 

57 

constitution 

(39) 

controlled 

100 

control 

(5202) 

cooled 

137 

cool 

(2318) 

cooling 

318 

cool 

(2318) 

covering 

71 

cover 

(41) 

cracking 

59 

crack 

(1126) 

curved 

63 

curve 

(1104) 

cutting 

94 

cut 

(348) 

damping 

147 

damp 

(816) 

derived 

253 

derivation 

(1223) 

described 

396 

description 

(85) 

designed 

180 

design 

(6403) 

detailed 

73 

detail 

(22) 

determined 

347 

determination 

(402) 

determining 

187 

determination 

(402) 

developed 

404 

deve lop 

(1119) 

elevated 

244 

elevation 

(170) 

employed 

75 

employment 

(29) 

establ ished 

71 

establishment 

(14) 

evaluated 

118 

evaluation 

(655) 

existing 

77 

existence 

(13) 

extending 

73 

extension 

(127) 

fixed 

77 

fix 

(222) 

flowing 

59 

flow 

(8725) 

following 

83 

follower 

(17) 
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G.E.-2 

G.E.-2 

UNITERM 

PARENT 

COLLECTION 

X-C  WORDS 

FREQUENCY 

COGNATE 

FREQUENCY 

III.  VERBALS  (Cont.) 

forced 

98 

force 

(1175) 

formed 

71 

form 

(662) 

forming 

116 

form 

(662) 

generated 

64 

generate 

(330) 

heated 

119 

heat 

(7899) 

heating 

224 

heat 

(7899) 

improved 

138 

improvement 

(98) 

included 

183 

inclusion 

(97) 

including 

162 

inclusion 

(97) 

increased 

102 

increase 

(106) 

increasing 

77 

increase 

(106) 

indicated 

106 

induction 

(555) 

limited 

96 

limit 

(558) 

loaded 

67 

load 

(2567) 

manufacturing 

'12 

manufacture 

(660) 

measured 

199 

measure 

(581) 

melting 

131 

melt 

(816) 

mixing 

99 

mix 

(785) 

moving 

59 

movement 

(148) 

observed 

101 

observe 

(10) 

operating 

229 

operate 

(489) 

past 

89 

pass 

(36) 

performed 

88 

perform 

(1285) 

predicted 

69 

prediction 

(305) 

prepared 

66 

preparation 

(275) 

presented 

62 

present 

(31) 

processing 

124 

process 

(1037) 

produced 

191 

produce 

(455) 
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G.E.-2 

G.E.-2 

UNITERM 

PARENT 

COLLECTION 

X-C  WORDS 

FREQUENCY 

COGNATE 

FREQUENCY 

III.  VERBALS  (Cont.) 

producing 

71 

produce 

(455 ) 

proposed 

96 

proposal 

(125) 

reinforced 

97 

reinforce 

(407) 

related 

148 

relation 

(166) 

relating 

62 

relation 

(166) 

reported 

107 

report 

(180) 

resulting 

68 

result 

(83) 

reviewed 

82 

review 

(305) 

selected 

111 

select 

(184) 

simulated 

57 

4  simulating 

(38) 

solved 

64 

(simulation 

solving 

(430) 

(14) 

starting 

67 

8  tart 

(426) 

studied 

231 

study 

(561) 

supported 

94 

support 

(398) 

taken 

59 

take 

(29) 

used 

740 

use 

(79) 

using 

572 

use 

(79) 

varying 

85 

vary 

(429) 

welded 

82 

weld 

(2017) 

IV.  MODIFIERS 

analytical 

181 

analysis 

(4333) 

annular 

80 

annulus 

(459) 

applicable 

123 

application 

(714) 

available 

134 

availability 

(61) 

axisymmetric 

66 

axisymmetry 

(290) 

basic 

243 

base 

(793) 

capable 

71 

capability 

(75) 
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G.E.-2 

G.E.-2 

UNITERM 

PARENT 

COLLECTION 

X-C  WORDS 

FREQUENCY 

COGNATE 

FREQUENCY 

IV.  MODIFIERS  (Cont.) 

continuous 

102 

continuation 

(224) 

cylindrical 

198 

cylinder 

(1406) 

digital 

91 

digit 

(643) 

dimensional 

431 

dimension 

(1057) 

experimental 

881 

experiment 

(827) 

flexural 

58 

flexure 

(266) 

gaseous 

78 

gas 

(5347) 

German 

72 

Germany 

(55) 

magneto  hydro 

74 

magneto 

(541) 

mathematical 

86 

mathematics 

(816) 

metallic 

91 

metalloid 

(24) 

metallurgical 

71 

metallurgy 

(1074) 

operational 

102 

operation 

(887) 

optical 

74 

optic 

(398) 

partial 

97 

part 

(402) 

protec  tive 

59 

protect 

(479) 

random 

60 

randomness 

(140) 

rectangular 

114 

rectilinear 

(21) 

relative 

120 

relativity 

(40) 

significant 

82 

significance 

(10) 

spherical 

88 

sphere 

(494) 

structural 

309 

structure 

(2819) 

theoretical 

395 

theory 

(1662) 

typical 

88 

type 

(127) 

useful 

76 

use 

(79) 
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APPENDIX  B 

X  WORDS 


(N  ■  noun; 

M  - 

modifier;  V  ■  verb; 

VB  ■  verbal  and 

A  ■  adverb) 

G.E.-2 

G.E.-2 

G.E.-2 

G.E.-2 

X  WORDS 

FREQUENCY 

X  WORDS 

FREQUENCY 

advantage (s) 

N 

142 

good 

M 

126 

agreement 

N 

88 

graphical 

M 

59 

amount (s ) 

N 

91 

greater 

M 

75 

arrangement (s) 

N 

57 

importance 

N 

62 

associated 

VB 

133 

important 

M 

77 

attempt  (s) 

N 

70 

involved 

VB 

80 

attention 

N 

62 

involving 

VB 

69 

best^ 

M 

94 

known 

M 

100 

better 

M 

70 

made 

V 

518 

cent 

N 

76 

make(s) 

V 

70 

certain 

M 

165 

necessary 

M 

86 

columbium 

N 

99 

need(8) 

N 

80 

complete 

M 

114 

nonlinear 

M 

141 

consideration 

N 

183 

note(s) 

N 

68 

considered 

VB 

276 

now 

A 

80 

discussed 

VB 

497 

obtained 

VB 

545 

discuss (es) 

V 

109 

obtaining 

VB 

64 

discussion(s) 

N 

294 

obtain(s) 

V 

112 

due 

M 

174 

occur (8) 

V 

100 

encountered 

VB 

65 

output (8) 

N 

99 

expressionCs) 

N 

115 

particularly  A 

58 

feature (s) 

N 

106 

particular 

M 

107 

found 

VB 

263 

permit (s) 

N 

89 

further 

M 

76 

possibility (ies) 

N  72 

give(s) 

V 

166 

possible 

M 

233 

given 

VB 

494 

previously 

A 

59 
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G.E.-2 

X  WORDS 

G.E.-2 

FREQUENCY 

previous 

M 

71 

principal 

M 

76 

provide (s) 

V 

203 

recent 

M 

103 

same 

M 

119 

satisfactory 

M 

83 

showed 

VB 

73 

G.E.-2 

G.E.-2 

X  WORDS 

FREQUENCY 

showing 

VB 

65 

shown 

V 

246 

show(s) 

V 

224 

subjected 

VB 

162 

subject  (s) 

N 

69 

suitable 

M 

156 
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APPENDIX  C 

XG  WORDS 


G.E.-2 

G.E.-2 

XG  WORDS 

FREQUENCY 

AD 

205 

AF33 

94 

AF 

249 

AL 

115 

B 

102 

CO 

65 

CR 

125 

C 

382 

DEG 

396 

DER 

85 

DE 

163 

DES 

103 

D 

201 

E 

59 

F 

402 

FT 

81 

H 

57 

III 

88 

II 

200 

J 

71 

G.E.-3 

XG  WORDS 

G.E.-2 

FREQUENCY 

K 

92 

LA 

69 

L 

68 

M 

7364 

N. 

102 

0 

84 

PCT 

241 

P 

95 

PSI 

98 

QPR 

62 

RE 

90 

R 

76 

SEC 

69 

S 

86 

T 

85 

VOL 

62 

V 

164 

W 

124 

X 

241 

Total  12515 
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VOCABULARY  DISTRIBUTION  STUDIES 


Token  Frequencies  and  Entropy  Calculations  for  the  GE  2  Collection* 

A.  Introduction 

Detailed  knowledge  of  the  statistical  parameters  of  the  G.E.-2 
collection  is  of  considerable  importance  to  the  task  of  extrapolating 
results  obtained  with  this  collection  to  others  that  might  be  "similar". 
In  other  Technical  Notes,  the  report  to  the  USAF  (ESD  -  TR-66-405)  and 
elsewhere,  a  number  of  such  observations  and  measurements  upon  the 
collection  have  already  been  recorded  as  a  byproduct  of  one  investiga¬ 
tion  or  another.  Continuing  this  process  of  determining  the  values 
of  the  collection  parameters  believed  to  be  important,  this  note  is 
primarily  concerned  with  the  calculation  of  the  uncertainty  statistics 
for  the  word  strings  that  were  used  in  the  collection.  We  capitalize 
on  the  somewhat  unusual  opportunity  offered  to  us  to  determine  the 
values  of  H(x) , . . . ,  H(w,  x,  y,  z)  for  a  body  of  scientific  text  of 
about  500,000  words  as  an  incidental  byproduct  of  other  work. 

Because  our  procedures  for  gathering  the  string  statistics  affect 
the  way  in  which  the  uncertainty  measures  may  be  interpreted,  we  first 
review  these  procedures  briefly  in  Section  B.  The  entropy  calculations 
are  then  recorded  for  strings  of  length  1,  2,  3,  and  4  in  Section  C. 

The  discussion  in  Section  D  is  devoted  to  developing  estimates  of 
conditional  entropy  for  the  collection.  It  provides  a  partial  descrip¬ 
tion  of  the  corpus  using  information- theoretic  terms.  Finally,  we 
turn  attention  to  the  subset  of  the  corpus  that  was  "underlined"  by 
automatic  indexing.  The  entropy  attributable  to  the  words  used  in 
automatic  indexing  (the  G.E.-2A  vocabulary)  is  determined  and  it  is 
shown  that  about  51%  of  the  total  entropy  is  contributed  by  these 
words . 


*  By  Paul  E.  Jones.  Not  previously  issued. 
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B.  Gathering  String  Statistics* 

1.  The  Collection 

The  data  base  on  which  the  present  experimental  observations 
were  made  is  the  GE-2  collection,  consisting  of  10,287  abstracts  selec¬ 
ted  from  a  GE  parent  collection  of  approximately  45,000  abstracts. 

We  excluded  abstracts  that  were  "short'1  (e.  g.,  those  which  were  very 
brief)  but  the  procedures  for  selection  were  otherwise  random,  invol¬ 
ving  picking  every  third  one.  Each  abstract  consists,  for  the  present 
purposes,  of  a  title  and  the  text  of  the  abstract  that  was  given. 

2.  Word  Form  Tokens 

In  processing  this  collection  by  computer,  the  text  of  each 
abstract  was  considered  to  be  a  string  of  word  form  tokens;  i.e., 
basically  strings  of  characters  separated  by  the  space  symbol.  There 
were  some  minor  variants  to  the  use  of  space  as  a  word  form  token 
separator.  Specifically,  hyphenation  in  the  creation  of  composite  words 
was  ignored:  for  example,  META-THEORY  was  regarded  as  two  word  forms 
META  and  THEORY  in  successive  positions.  On  the  other  hand,  end-of- 
line  hyphenation  was  recognized,  and  broken  words  were  glued  together. 
Naturally,  there  were  instances  where  end-of-line  hyphenation  corres¬ 
ponded  to  composite  word  creation,  and  in  these  (rare)  instances,  the 
words  were  (improperly)  glued  together.  There  were  also  many  instances 
in  which,  due  to  transcription  errors,  this  end-of-line  hyphen  was  not 
present.  In  such  cases,  extra  words  came  to  be  generated  (e.g.,  DOCT¬ 
OR  could  yield  the  word  form  DOCT  if  the  hyphen  were  omitted.) 

By  and  large,  however,  the  definition  of  word  form  token  corres¬ 
ponds  accurately  to  the  natural  segmentation  of  text  using  spaces. 

But  note  for  completeness  that  numbers  (like  0.02)  count  as  word  forms 
and  that  symbols  (like  W  for  Tungsten)  do,  too.  Moreover,  note  that 
the  Roman  numeral  I  is  not  distinguishable  from  the  pronoun  of  the  same 
form,  etc. 


*  The  programs  used  for  this  purpose  were  prepared  and  run  by  Miss 

Joyce  Mehring.  For  a  detailed  description,  see  Technical  Note  CACL-13. 
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The  comma  (,)  and  the  period  (.)  punctuation  marks  are  ignored. 

Each  abstract  was  therefore  ’'seen'1  as  a  string  of  m  tokens,  none  of 
which  was  a  period  or  comma.  All  our  statistics  are  based  on  this  view 
of  the  text. 

3.  String  Generation  and  Counting 

Based  on  this  view,  string  statistics  were  obtained  by 

a.  "Passing  a  four-word  window"  over  the  m  word  string  for 
every  abstract  and  recording  each  such  four-word  string 
separately  on  an  output  tape. 

Example:  If  the  abstract  began  with  the  string  of  word 

forms :  a  b  c  d  e  f  g  ... 
we  would  generate 
abed 
bede 
edef 
defg 
etc . 

One  string  is  generated  for  each  text  position  at  the  beginning  of 
the  abstract.  In  order  to  permit  this  correspondence  to  continue  at  the 
end  of  the  abstract,  we  added  up  to  three  "dummy"  positions  (blanks)  at 
the  end.  Thus,  if  the  end  of  the  abstract  string  were 

...  u  v  w  x  y  z  ,  we  would  generate: 
uvwx 
vwxy 
wxyz 
xyz  a 
y  zaa 

ZAV> 

b.  The  set  of  such  four-word  strings  was  alphabetically 
sorted  to  group  recurrences  of  the  initial  substrings  of 
the  four-word  strings.  This  is  the  "Master  Context  List". 

c.  These  initial  substrings  were  counted  to  yield  string 
statistics  as  described  below. 
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C.  Frequency  Data  and  Entropy  Calculations  for  Strings 

I.  Single  Word  Forms 

a.  Frequency  Data  for  Single  Words 

The  Master  Context  List  contains  a  four-word  string  for 
each  position  of  the  text;  the  word  form  that  appeared  in  a 
given  position  in  the  text  is  the  first  constituent  of  the 
corresponding  four-word  string.  To  count  frequencies  of 
single  word  forms,  these  "first  constituents"  in  the  sorted 
Master  Context  List  were  tallied.  As  a  result,  word  "types" 
were  constructed  and  the  frequency  of  usage  of  each  was 
obtained . 

These  types  (with  frequency)  were  then  arranged  in  fre¬ 
quency  order  and  a  second  summary  was  made  by  counting  the 
number  of  types  occurring  with  each  frequency.  This  process 
yielded  a  report  stating,  for  each  distinct  frequency  that 
was  observed,  the  number  of  types  that  occurred  with  that 
frequency.  For  example,  there  was 

1  type  with  frequency  30170 
1  type  with  frequency  16622 

1370  types  with  frequency  3 

2929  types  with  frequency  2 

12485  types  with  frequency  1 

Using  these  figures,  it  was  possible  to  count  the  total 
number  of  tokesn  in  the  text,  by  accumulating  the  total 
number  of  tokens  accounted  for  by  each  line  of  the  above 
summary.  Thus, the  first  line  accounts  for  30170  tokens, 
the  next  to  last  line  accounts  for  2  x  2929  tokens,  and  the 
last  accounts  for  12,485  tokens.  The  grand  total  was  446,097 
tokens,  and  this  is  the  total  number  of  text  positions. 

There  were  23,505  different  types  (i.e.,  distinct  word 
forms)  recognized  by  the  formal  procedures  described  in 
Section  A.  Thus,  this  set  of  types  includes  numerals, 
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misspellings,  parts  of  improperly  broken  words;  i.e.,  exactly 
as  they  appeared  in  the  text.  This  set  of  23,505  types  is 
taken  as  the  symbol  vocabulary  of  the  collection. 
b.  Entropy  Calculations  for  Single  Words 

The  entropy  H(x)  of  this  symbol  vocabulary  was  computed 
based  on  the  following  procedure: 

(1)  From  each  line  of  the  summary,  we  know  a  distinct 
frequency,  f  ,  for  some  set  of  types.  A  type  of  this 
frequency  contributes 


to  the  total  entropy  H(x) . 

(2)  If  there  are  n^types  that  occurred  with  that  fre¬ 
quency  f^,  in  aggregate  they  contribute  n^h^  to  the  total 
entropy  H(x) . 

(3)  The  total  entropy  H(x)  was  obtained  by  cumulating 
the  n^h^  for  the  lines  of  the  summary  report.  The  calcu¬ 
lated  value  was: 

H(x)  =  10.36  -  0.01  (Estimated  error)  bits  per  word. 
The  estimated  error  is  probably  larger  than  it  should  be. 

A  more  detailed  analysis  would  take  note  of  the  fact  that 
10  digits  were  used  in  the  mantissa  of  floating-point  calcu¬ 
lation  in  1401  FORTRAN  2  and  that  the  major  source  of  error 
is  probably  the  LOG  routine  provided  in  the  system  library-- 
whose  error  properties  are  not  known  to  us  at  this  time. 

Any  systematic  error  in  that  routine--e.g. ,  a  value  slightly 
too  low--would  be  accumulated  23,505  times  by  this  process. 

2.  Two  Word  Strings 

a.  Frequency  Data  for  Two-Word  Strings 

Almost  exactly  the  same  procedure  was  followed  with  two- 
word  strings  as  that  used  for  single  words.  However,  because 
there  were  so  many  one-occurrence  two-word  strings,  these 
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were  not  actually  counted.  Rather,  all  the  two-word  strings 
that  occurred  more  than  once  were  counted.  As  before,  types 
were  produced,  arranged  by  frequency  and  summarized.  The  sum¬ 
mary  text  was  of  the  same  form;  e.g., 

1  type  with  frequency  2331 

1  type  with  frequency  1469 

8639  types  with  frequency  3 

15877  types  with  frequency  2 
166,911  types  with  frequency  1. 

b.  Entropy  Calculation  for  Two-word  Strings 

The  entropy  of  the  pair  vocabulary  is  easily  calculated 
using  the  same  procedure  as  before.  That  is,  we  would  readily 
regard  any  pair  that  occurs  with  frequency  f^  to  contribute 

hi  ■  (r)  l082  (-£  ) 

to  the  entropy  Shannon  calls  H  (x,y).  Had  our  entire  collec¬ 
tion  of  abstracts  been  regarded  as  a  single  long  string,  this 
would  indeed  by  the  correct  procedure.  But  in  the  procedure 
we  followed  for  moving  a  window  over  the  text,  we  introduced 
a  dummy  (blank)  at  the  end  of  each  message  and  regarded  the 
collection  as  composed  of  10,287  shorter  units.  It  follows 
that  we  have  generated  pair  types  of  the  form  "za",  that  in 
aggregate  occur  10,287  times.  Some  of  them  probably  occur 
quite  frequently  (e.g.,  30  times)  reflecting  some  propensity 
to  end  abstracts  with  the  word  z  .  We  do  not  know  which 
types  are  of  this  "terminal"  kind;  accordingly,  we  must  be 
careful  in  interpreting  the  value  of  H(x,y)  obtained  from 
the  formula. 
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We  regard  our  symbol  source  as  something  which  emits  a 
special  terminating  symbol  (dummy)  at  the  end  of  every  message 
To  identify  this  notion  of  source  (in  contrast  to  the  case  of 
single  words  where  there  was  no  need  to  consider  the  dummy 
symbol)  we  call  the  pair  entropy  (x,y)  to  note  the  fact 
that  one  Dummy  is  emitted  per  message. 

The  calculation  of  H  (x,y)  is  performed  by  computing 
the  probability  (xy)  of  each  pair  type  using 


_1D  .  x 
P  (xy)  = 


flD,  x 

f  <*y) 

446,097 


where  (xy)  is  the  pair  frequency  given  in  the  summary  list 
and  446,097  is  the  total  number  of  such  pairs  in  our  sample 
of  symbol  strings  emitted  by  the  ID  sources. 

The  value  of  H1D  (x,y)  was  calculated  for  the  collection 
using  the  same  procedure  previously  described.  The  result  was 


H 


ID 


(x,y)  -  16.26  t  -01  bits  per  digram 


3.  Three-Word  Strings 

a.  Frequency  Data  for  Three-word  Strings 

The  same  procedure  was  used  for  tallying  three-word  fre¬ 
quencies  as  that  described  for  pairs,  except  that  strings  with 
frequency  2  and  1  were  not  counted.  Nor  was  it  possible  to 
deduce  the  exact  members  of  strings  with  these  frequencies  as 
was  possible  in  the  case  of  two-word  strings. 

It  was  possible,  however,  to  calculate  bounds  on  the  en¬ 
tropy  by  considering  the  extreme  way  the  remaining 
(446,097  -  64,412)  *  381,  687  residual  3  strings 
could  be  distributed  over  2-occurrence  and  1-occurrence  types. 
The  two  extreme  cases  are  shown  below. 

Case  A:  Suppose  the  three-word  strings  in  question  are 
all  at  frequency  1,  i.e., 

0  types  with  frequency  2 
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381,687  types  with  frequency  1 
total  types  =  381,687 

Case  B:  Suppose  as  many  as  possible  of  the  three-word  strings 

occur  with  frequency  2  (and  as  few  as  possible  with  frequency  1). 

Because  there  are  166,911  two  word  types  AB  that  occur  only 

once,  we  know  there  are  166,911  triplet  types  ABC  that  also 

occur  only  once.  Thus  Case  B  results  in 

107,388  types  with  frequency  2 

166,911  types  with  frequency  1 

b.  Entropy  Estimate  for  Three-word  Strings 

Using  the  same  procedure  previously  described  under  the 

discussion  for  two-word  strings,  we  obtained  the  following 
2D 

values  of  H  (x,  y,z)  for  the  assumed  extreme  distributions 
of  low  frequency  types. 

(x,y,z)  =  18.34  (all  residual  strings  taken  as  f=l) 

(x,y,z)  =  17.86  (as  many  residual  strings  taken  to 

have  f=2  as  possible) 

2D 

Thus,  H  (x.y.z)  =  18.10  t  0.24 
4.  Four-Word  Strings 

a.  Frequency  Data  for  Four-word  Strings 

The  situation  with  four-word  strings  is  exactly  like  that 

for  three-word  strings.  The  number  of  types  that  occurred 

with  frequency  ^  2  was  not  recorded.  The  residual  (446,097  - 

14,855)  =  431,242  four-word  strings  could  be  distributed  in 

the  same  extreme  ways  as  before: 

Four-word  strings 

Case  A:  Assume  all  the  types  have  frequency  1 
431,242  types  with  frequency  1 
0  types  with  frequency  2 

Case  B:  Assume  as  many  as  possible  of  the  residual  types 
have  frequency  2.  Arguing  as  before,  we  know  there  are 
at  least  166,  911  triplets  with  frequency  1  and  hence 
at  least  that  many  quadruplets.  Thus  the  extreme  case  is 
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166,911  types  with  frequency  1 
132,165  types  with  frequency  2 
b.  Entropy  Estimate  for  Four-word  Strings 

The  entropy  calculations  were  carried  out  for  both  cases 
as  usual.  The  contribution  due  to  all  the  quadruplets  with 
f  ^-3  is,  we  note,  only  0.54. 

Case  A:  (w,x,y,z)  =  18.7 

Case  B:  (w,x,y,z)  =  18.1 

3D 

Thus,  H  (w,x,y,z)  =  18.4  *  .4  bits  per  quadruplet 

D.  Conditional  Entropy  Estimates 

1,  General 

The  ordinary  procedure  for  determining  conditional  entropy 
cannot  be  applied  at  once  to  the  data  we  have  gathered.  As  pointed 
out  in  Section  C-2b,  we  are  not  treating  the  text  as  one  long 
fragment  but  as  10,287  short  pieces.  We  have  introduced  MdummyM 
symbols  between  the  messages,  and  the  presence  of  these  dummies 
is  a  complication  that  must  be  dealth  with.  To  exhibit  the  diffi¬ 
culty  most  clearly,  we  need  only  notice  that  H  (x,y)  was  obtained 
from  a  source  among  whose  symbols  a  dummy  was  present,  whereas 
H  (x)  was  obtained  from  a  source  that  produced  no  dummies.  Thus, 
even  though  the  entropy  values  calculated  in  Section  C  were  all 
based  on  the  same  corpus,  we  --  in  a  limited  sense  --treated  it  as 
four  distinct  sources  in  the  information  theoretic  sense. 

The  formulas  for  conditional  entropy,  e.g., 

H(x,y)  -  H(x)  =  Hx(y) 

cannot  be  used  without  giving  some  attention  to  this  point. 

2.  Approximations  and  Estimates 

Suppose  we  treat  the  source  as  a  ID#  source,  i.e.,  the  corpus 

consists  of  the  texts  of  the  messages  with  a  single  dummy  space 

between  each.  We  need  to  adjust  (x,y)  and  H  (x)  to  correspond 

ID#  ID# 

to  this  source  in  order  to  find  H  x  (y).  The  value  of  H  (x) 
however,  is  very  closely  related  to  H  (x) ;  one  more  symbol  (the 
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dummy)  has  been  added  with  frequency  10,287  and  the  total  length 
of  the  sample  has  been  increased  by  10.287.  The  addition  of  the  n 
new  symbol  raises  H  by 


10,287  .  10,287 

“  446,097  g2  446,097 


.0229  log2  =  0229  =  .12 


The  increased  length  of  the  sample  changes  all  the  from  expres 
sions  of  the  form 

446,097 

to  expressions  of  the  form 


446,097  +  10,287 

Accordingly,  all  the  Pi  log  P^  aggregated  to  form  H  (x)  are  slightly 

ID// 

reduced  in  forming  H  (x).  Since  the  two  effects  tend  to  balance 

out  to  some  extent  we  can  estimate 
1  TV) 

H  (x)  -  (H(x)  +  .05)  +  .05 


The  same  kind  of  argument  needs  also  to  be  applied  to  adjust 
ID 

the  computed  value  of  H  (x,y).  In  tallying  the  strings,  we  did 
not  count  10,287  pairs  of  the  form  ^dummy,  first  word  of  message^  . 
Also,  we  are  now  considering  that  the  text  includes  the  dummies, 
so  that  the  are  all  reduced. 

Again,  the  effects  are  counteracting,  and  we  deal  with  it  by 
increasing  the  error  bounds  on  (x,y): 

H1W'  (x,y)  -  H1D  (x,y)  +  .05 

We  can  now  compute  the  conditional  entropy  for  the  ID//  source. 


H1*'  (y) 

X  w 


H1D#  (x,y)  -  H1W\x) 

16.26  t  .06  -  (10.36  -  .01  +  .05  -  .05) 


H1D#(y)  =  5.95  +  .12 

X  w 

The  foregoing  discussion  outlines  the  factors  that  must  be  considered 
before  the  uncertainty  figures  reported  here  can  be  compared  and 
contrasted  with  those  for  other  collections  used  as  sensitive 
parameters  in  theoretical  work.  It  is  clear,  however,  that  the 
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various  ways  of  interpreting  the  nature  of  the  symbol  source  cause 
only  minor  fluctuations  in  the  entropy  values.  For  many  informal 
purposes,  these  variations  can  be  ignored. 

With  this  view  in  mind,  we  estimate 
H  (z)  =  5.9  t  .2  bits 

y 

H  (z)  =  18.1  !  .3  -  (16.3  t  .1) 

xy 

=  1.8  t  .4  bits 

The  cumulative  errors  are  too  great  to  permit  us  to  estimate 

H  (z). 
wxy 

3.  Contribution  to  H(x)  of  GE-2A  Automatic  Indexing  Vocabulary 

The  subset  of  1434  word  forms  making  up  the  GE-2A  vocabulary 
was  identified  and  their  contribution  to  H(x)  was  calculated  using 
the  formula 

f  i  .  fl 

446,097  82  446,097 

The  value  obtained  was  5.25  *  .01  bits  for  a  total  of  219,913  tokens  or 

49.307.  of  the  text.  Since  H(x)  -  10.36  1  .01  bits,  517.  of  the  total  is 

contributed  by  our  GE-2A  vocabulary. 
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ZIPF 1 S  LAW  AND  HERDAN'S  LAW  OF  SOLIDARITY  * 


A.  Introduction 

Zipf 1 s  Law  for  vocabulary  distribution  states  that  the  rank  of  a 
word  type  times  its  frequency  of  usage  in  a  sample  of  text  is  (approxi¬ 
mately)  a  constant.  Herdan** arranges  the  same  data  differently  and  claims 
that  the  number  of  word  types  occurring  with  frequency  n  is  predicted  by 
the  Waring  distribution.  (See  TN-CACL. 30)  He  points,  with  especial 
interest,  at  the  gradient  of  the  curve,  expressed  as  the  ratio  of  the 
number  h  ^  of  types  occurring  with  frequency  nfl  to  those  occurring  with 
frequency  n.  This  ratio  is  given  by  the  Waring  distribution  to  be 


nfl 


af  n-2 
x+n-1 


(1) 


n 

where  a  and  x  are  constant  parameters  of  the  distribution.  Herdan  con¬ 
siders  this  feature  of  the  Waring  distribution  -  the  regularity  of  the 
ratio  of  successive  terms  -  to  be  quite  significant,  (p.89) 

This  relation  between  successive  terms  of  the  series 
remains  unaltered  (invariant)  despite  the  change  in 
numerical  values  with  sample  size.  The  gradient  is 
thus  an  invariant,  epitomizing  the  system  of  solid¬ 
arity  among  the  vocabulary  items. 


On  page  91  he  writes 

This  means  that  the  accumulation  of  vocabulary  items 
in  the  various  classes  of  the  variable  follows  some 
sort  of  solidarity  mechanism.  Such  a  mechanism  is 
implied  by  the  Waring  distribution,  representing  a 
gradient  of  probabilities,  each  of  which  stands  in  a 
definite  relation  to  the  preceding  one,  which  con¬ 
stitutes  what  we  have  called  the  invariance  of  the 
vocabulary  distribution  function. 

And  later 

"Considering  that  the  latter  (gradient  of  the  fre¬ 
quencies  in  the  successive  class  intervals  of  the 
variable  described  by  the  Waring  distribution)  is 
brought  about  by  the  solidarity  of  the  system  of 
vocabulary,  I  propose  to  call  it  a  Law  of  Solidarity..." 


*  Issued  on  May  10,  1966  to  a  limited  distribution  as  a  Supplement  to 
Technical  Note  CACL-30  by  Paul  E.  Jones,  Jr. 

**  Herdan,  G..  Quantitative  Linguistics,  Butterworths ,  London,  1964. 
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This  note  is  devoted  to  demonstrating  that  a  corpus  which  satis¬ 
fies  Zipf's  Law  has  a  "Law  of  Solidarity"  of  almost  identical  form. 
Moreover,  when  this  result  is  coupled  with  the  empirical  result  pre¬ 
sented  below  (which  showed  that  a  Zipf  plot  of  the  Herdan-Waring 
distribution  yields  a  straight  line  for  the  data  we  studied),  the 
conjecture  that  the  Herdan-Waring  distribution  is  very  similar  to 
Zipf's  is  reinforced. 


B.  The  Corresponding  Gradient  In  a  Zipf  Distribution 

Consider  a  sample  of  text  containing  a  total  of  T  distinct  types,  of 

which  h.  occur  exactly  once.  Let  h  in  general  denote  the  number  of  types 
1  n 

that  occur  with  frequency  n.  Let  r  designate  the  "rank"  of  those  types 

n 

that  occur  with  frequency  n.  (Since  there  may  be  many  types  occurring 
with  frequency  n,  we  adopt  the  convention  that  by  "rank1*  r^  we  mean  the 
largest  rank  assigned  to  the  set  of  words  with  frequency  n.  On  a  Zipf 
plot,  then,  we  consider  the  right  hand  edge  of  the  descending  steps  to 
be  the  point  which  defines  "rank".) 

Consider  a  sample  in  which  Zipf’s  Law  holds  --  using  the  foregoing 
definition  of  "rank"  for  types  with  the  same  frequency. 

Because  we  know  there  are  T  types  in  all,  we  know  that  r^*T,  that 
r^=T-h^,  and  in  general  that 


r  .  =r  -h 
nfl  n  n 

But  if  Zipf’s  Law  holds,  then 


n.r 

n 


const.  =  (nfl)  r  .  *  (rH-1)  r  -  (nfl)  h 
nf  1  n  n 

n.r  f  r  -  (nfl)  h 
n  n  n 


h  =  1  r 

n  /  ,  i  \  n 
.  .  (rH-1) 


Similarly, 


h  -  1  (r  -  h  ) 

nfl  jw7  n  n 


Thus  , 


nfl  =  n 

i -  IffT 

n 


n 

!Tf7 


h 

n 
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which  is  a  Law  of  Solidarity  for  the  Zipf  distribution  comparable  to 
Herdan's  version.  (1) 

Obviously,  setting  a=2  and  x=3  makes  the  two  gradients  identical .  It 
would  be  well  to  look  next  at  whether  a  and  x  can  depart  significantly 
from  these  values  in  Herdan’s  development, 
c .  Conclusion 

Herdan  may  be  reading  too  much  into  the  significance  of  the  Waring 
distribution's  regular  gradient.  I  find  it  difficult  to  see  what  he  is 
so  excited  about  in  view  of  the  above  result. 

Work  continues  in  an  attempt  to  place  the  Waring  distribution  in  a 
form  that  makes  it  apparent  what  the  corresponding  rank-frequency  plot 
looks  like. 


-61- 


Section  III:  CACL-30 


FITTING  THE  HERDAN-WARING  DISTRIBUTION 
TO  THE  VOCABULARY  USAGE  DISTRIBUTION 
IN  THE  GE-2  CORPUS 


A.  Introduction 


Herdan  is  critical  of  the  Zipf  model  for  vocabulary  distribution.  He 
prefers  rather  to  work  with  the  number  t^  of  types  that  occur  with 
frequency  n  in  the  text.  For  the  GE-2  collection  we  have  such  data 
in  a  table 


Number  of  Types  Frequency 
1  30170 

1  16222 


2929  2 

12485  1 

Herdan  suggests  that  the  Waring  distribution  is  an  appropriate  function 
to  fit  to  this  vocabulary  distribution.  This  note  represents  a  test  of 
Herdan's  claim  as  applied  to  the  GE-2  message  collection. 

Waring's  expansion  for  l/x-a  is  reported  by  Herdan  to  go  back  to  the 
18th  century.  This  expansion  is  written  (we  have  corrected  the 
misprints) : 

_i_  »  It  “ _  ,  _ «<«+D  * 

x-a  x  x(x+l)  x(x+l)(x+2) 


Issued  on  May  9,  1966  to  a  limited  distribution  as  Technical  Note 
CACL-30  by  Peter  R.  Bono  and  Paul  E.  Jones,  Jr. 

Herdan,  G.  Quantitative  Linguistics.  Butterworth,  London,  1964. 
(See  especially  p.  85  ff.) 
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and  is  convergent  for  x  >  a  >  0  .  If  we  multiply  by  (x-a)  ,  the  series 

on  the  right  suras  to  unity,  and  Herdan  asserts  that  the  n ^  term  may  be 

interpreted  as  P  ,  the  fraction  of  the  total  vocabulary  T  (i.e.,  the 
n 

proportion  of  all  the  types  in  the  sample)  that  are  found  occurring  in 
the  text  with  frequency  n  . 

Given  the  correct  choices  of  x  and  a  ,  Herdan  claims  that  the 
Waring  distribution  should  fit  observed  data,  like  that  in  the  table 
above  that  shows  the  number  of  types  with  frequency  of  occurrence  =  n. 

The  key  to  fitting  the  Waring  distribution  to  the  sample  data  is  to  choose 
values  of  x  and  a  .  For  this  purpose  Herdan  exhibits  the  following 
procedure . 

B.  Procedure 


1 .  Notation 

Let  x  be  the  average  frequency  of  occurrence  of  a  word;  i.e., 


where  N 
T 

and  p^ 


number  of  running  words  (446097  in  GE-2) 
number  of  different  types  (23505  in  GE-2) 
fraction  of  vocabulary  of  types  which  is 
accounted  for  by  words  which  occur  only 
once  (i.e.,  by  12485  types  in  GE-2). 
fraction  of  types  occurring  with  frequency  n. 


2.  Procedure 

Herdan  provides  tables  which  can  be  entered  using  x  and  p^  above. 
These  tables  provide  the  value  p^  (n=2,  3,..., 50).  So  they  yield  the 
expected  number  of  types  with  frequency  n  if  we  multiply  p^  •  T  . 

3 .  Experiment  at  fitting 

We  first  attempted  to  follow  Herdan’s  procedure  using  our  data 
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and  to  determine  the  fit  between  predicted  and  observed  values.  For  the 
GE-2  collection  we  calculate 


446097 

23505 


18.9788  mean  frequency 


=  12485 

P1  23505 


.53116  fraction  of  words  with  f=l 


and  the  desired  predicted  values  of  were  obtained  from  interpolation 
in  Herdan's  table  on  Pages  267-8. 


The  product  •  T  together  with  the  observed  number  of  types  with 
frequency  i  ,  are  plotted  against  i(x-axis)  in  Figure  1.  The  figure 
is  cut  apart  for  ease  of  printing. 


The  fit  was  considerably  better  than  had  been  expected.  The  two  curves 

have  very  much  the  same  shape.  Because  of  the  fitting  procedure,  the 

two  curves  come  together  (off-scale)  at  frequency  ■  1.  On  the  other 

hand,  the  differences  between  predicted  and  observed  numbers  of  types  is 

substantial  for  low  frequencies  >  1.  This  causes  such  large  contributions 
2 

to  x  that  we  did  not  consider  it  worthwhile  to  complete  the  details 
of  the  test. 


In  view  of  the  fitting  procedure's  dependence  on  the  number  of  types  that 
occur  with  frequency  1,  we  recalled  that  many  of  these  types  counted  by 
the  machine  were  actually  misspelled  words.  Estimates  of  the  number  of 
types  in  the  frequency  1,  2,..., 10  ranges  which  were  occurrences  or 
repeated  occurrences  of  misspelled  words  had  previously  been  obtained 
(Technical  Note  CACL-13,  Supplement).  We  wondered  whether  the  blind 
counting  we  had  employed  might  not  be  the  reason  of  the  inexact  fit  in 
Figure  1. 

Accordingly,  we  --  in  effect  --  threw  out  all  Instances  of  misspelled 
words  from  the  running  text  of  the  GE-2  corpus  and  treated  the  remaining 
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text  as  if  it  had  been  the  original  sample.  This  was  accomplished  by 

(a)  subtracting,  from  the  observed  number  of  types  with  frequency  i, 
the  estimated  number  of  misspelled  types  with  that  frequency 
(for  isl,...,10).  The  resulting  new  values  were 


frequency  i 

Observed 
Number  of  Types 

Estimated  % 
Misspelled 

Types 

Misspelled 

Corrected 

Number 

i 

12485 

42.5% 

5306 

7179 

2 

2929 

16.0% 

469 

2460 

3 

1370 

7.3% 

100 

1270 

4 

824 

4.5% 

37 

787 

5 

631 

3.0% 

19 

612 

6 

422 

2.3% 

10 

412 

7 

346 

1.8% 

6 

340 

8 

310 

1.4% 

5 

305 

9 

268 

1.2% 

3 

265 

10 

221 

1.0% 

2 

219 

Total  5957 


(b)  The  total  number  of  types  T  was  reduced  by  the  number  of  types 
misspelled  to  yield  23505  -  5957  =  17548. 

(c)  The  total  number  of  tokens  in  the  text  was  reduced  by  the  number 
of  misspelled  words  "thrown  out".  (See  Technical  Note  CACL-13, 
Supplement,  last  page  for  determining  the  estimate).  The  total  = 
6976  words  misspelled;  hence  439121  is  the  reduced  number  of  tokens 
in  the  text. 

Because  Herdan*s  tables  did  not  cover  the  interval  into  which  these 
numbers  fell,  a  small  program  was  prepared  to  compute  the  Waring  dis¬ 
tribution.  We  checked  the  previous  set  of  table  values  used  in  develop- 
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ing  Figure  1,  and  checked  Herdan's  computation  of  the  distribution  on 
Page  87.  Both  were  correct  within  the  limits  of  accuracy  of  the  pub¬ 
lished  tables. 

Figure  2  shows  the  plot  for  the  reduced  GE-2  collection  with  misspellings 
Mthrown  out".  A  slight  improvement  is  observed. 

C.  Conclusions 


1.  The  Herdan-Waring  distribution  reflects  with  moderate  accuracy  the 
observed  distribution  of  the  number  of  types  which  have  occurrence 
frequency  n  . 

2.  The  exclusion  of  misspelled  words  from  the  text  in  an  effort  to 
improve  the  fit  resulted  in  a  modest  improvement,  but  this  improvement 
is  less  interesting  than  the  accuracy  of  the  fit  achieved  without  any 
intervent  ion . 

3.  It  would  be  valuable  to  determine  this  distribution,  when  converted 
into  a  predictor  of  the  rank- frequency  curve,  would  account  for  or  give 
clues  about  the  bow  in  the  Zipf  curve.  This  conjecture  will  be  treated 
further  in  subsequent  notes.  Preliminary  evidence  suggests  that  the 
Herdan-Waring  distribution  for  a  sample  produces  a  vocabulary  distri¬ 
bution  that  closely  resembles  Zipf's,  at  least  at  the  right  hand  end  of 
the  rank  frequency  plot.  Figure  3  shows  the  results  of  plotting  the 
Herdan-Waring  values  in  Zipf  form  for  the  two  sets  of  text  parameters 

we  ran  through  the  program.  Curve  A  is  for  the  GE-2  collection  discussed 
in  this  note.  Curve  B  is  for  the  sample  of  Pushkin  to  which  Herdan 
applies  his  procedures  (Page  87)  and  which  we  used  to  check  the  program. 

A  straight  line  of  slope  -1  is  included  as  usual  to  aid  the  eye. 
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Section  III: 


Supplement  to  CACL-13 


RANK  VS.  FREQUENCY  PLOTS  FOR  1 .  2,  3 ,  AND  4 

WORD  STRINGS  IN  G.E.-2  CORPUS  * 


This  supplement  contains  the  Rank-Frequency  plots  for  the  context  strings 
counted  in  the  processing  of  the  G.E.-2  data.  Several  supplementary 
programs,  prepared  and  run  by  J.  Mehring,  were  used  to  obtain  these  counts. 

The  four  plots  are  attached.  The  reader  will  note  that  the  origin  of  the 
coordinate  system  for  the  first  plot  (single  words)  is  different  from  that 
used  for  the  other  three  figures. 


Observations 


1.  The  plot  for  one-word  contexts  (i.e.,  word  forms)  does  not  show 
the  straight  line  with  slope  -1  appropriate  to  the  Zipf  curve. 

It  has  a  distinct  curvature  which  presumably  can  be  attributed 
to  the  fragmented  nature  of  the  corpus  and  the  heterogeneity  of 
the  subjects  treated. 

2.  The  plots  for  the  context  strings  of  various  lengths  all  show 
rather  straight  lines  with  slope  in  the  vicinity  of  -%.  I  think 
the  straightness  is  interesting. 

This  supplement  is  being  distributed  for  general  interest.  So  far  as  I 
know,  these  are  the  first  Zipf  plots  of  context  strings  ever  prepared,  so 
providing  a  good  explanation  of  these  data  would  be  more  than  sheer  re¬ 
creation. 


Reference 


Zipf,  G.  K.,  Human  Behavior  and  the  Principle  of  Least  Effort,  Addison- 
Wesley,  Cambridge,  1949. 


*  Issued  on  May  5,  1965  to  a  limited  distribution  as  Supplement  to 
Technical  Note  CACL-13  by  Paul  E.  Jones,  Jr. 
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Rank -Frequency  Curve  for  Types  Discovered  In  GE  Collection 
of  ^  446,000  Running  Words  In  10,287  Fragments 
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ZIPF  CURVES  FOR  GE  AND  NASA  INDEXING  VOCABULARIES* 

We  have  repeatedly  observed  that  the  large  manual  indexing  vocabu¬ 
laries  we  have  studied  do  not  produce  a  straight-line  Zipf  curve.  They 
characteristically  are  bowed.  This  note  records  the  rank- frequency 
plots  for  two  manual  indexing  vocabularies,  the  GE-0  vocabulary  and 
NASA's. 


A.  Zipf  Curve  for  GE-0  Manual  Indexing  Vocabulary 

Figure  1  accurately  shows  the  rank- frequency  curve  for  the 
frequent  GE  terms,  where  the  frequency  plotted  is  a  term's  total 
postings  in  the  69,668-document  GE-0  collection.  Figures  were 
obtained  from  the  vocabulary  listing  dated  November,  1962. 

1 .  Data 

All  terms  with  frequency  ^  100  are  plotted  and  most,  but 
not  all,  of  the  lower  frequency  terms  are  also  plotted.  This 
is  because  we  had  only  a  partial  deck  of  terms  (with  frequen¬ 
cies)  in  keypunched  form.  To  complete  the  deck,  the  missing 
terms  were  identified,  their  frequencies  looked  up,  and  a 
new  card  containing  the  missing  frequency  was  prepared  (for 
frequencies  >  100). 

A  total  of  1,645  terms  with  frequency  4.  100  are  omitted 
from  the  plot.  The  dashed  line  shows  how  they  would  probably 
fit  in.  (Because  there  are  4,824  terms  in  the  vocabulary, 
the  curve  has  to  reach  frequency  1  at  rank  4824.) 

B.  Zipf  Curves  for  the  NASA  Vocabulary 

Figure  2  shows  the  usual  rank- frequency  plot  of  the  whole 
18,292  NASA  indexing  vocabulary,  reflecting  postings  to  the  sub¬ 
collection  of  about  100,000  documents  we  have  studied.  The  plot 
shows  10,083  postings  to  the  term  of  rank  1  and  drops  to  0  postings 
at  about  rank  16,000.  This  curve  is  definitely  non-Zipfian,  ex¬ 
hibiting  a  strong  bow  upwards. 

C.  The  Division  of  the  NASA  Vocabulary  into  Two  Separate  Vocabularies 

Due  to  the  high  density  of  multiple  word  terms  (MWT)  near  the 
low  frequency  end  of  the  NASA  vocabulary  list,  it  was  decided  to 
actually  tabulate  the  distribution  of  the  MWT' 8  over  the  whole 
vocabulary.  The  number  of  MWT's  on  each  page  of  the  NASA's  fre¬ 
quency-ordered  index  term  dictionary  were  sampled  and  the  proportions 
were  used  to  obtain  separate  rank  vs.  frequency  plots  for  the  MWT 
subvocabulary  and  the  Single  Word  Term  subvocabulary  as  shown  in 
Figures  3  and  4  respectively.  Again,  each  of  these  curves  is  seen  to 
_ be  decidedly  non-Zipfian. 

*  Issued  on  May  6,  1966  to  a  limited  distribution  by  Peter  R.  Bono  and 
Paul  E.  Jones,  Jr.  as  Technical  Note  CACL-29. 
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Zipf  Curve  for  GE-0  Indexing  Vocabulary 
(Accurate  for  f  100;  Approximate  for  f<  100) 
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■Rank  Vs.  Frequency  for  the  Multiple-Word  Terms  only  in  the  NASA  18  K  Index  Collection 
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SECTION  IV 


STUDIES  OF  CONTENT  BEARING  UNITS  IN  TEXT 


Selecting  Content  Bearing  Units* 


In  older  to  understand  the  work  on  overlap  described  in  the 
next  paper,  it  is  necessary  to  present  a  few  details  concerning  the 
processing  of  the  data,  in  addition  to  that  in  CACL-13  in  Section  I. 
The  GE-2  abstract  collection  was  processed  to  reveal  all  two-word 
strings,  the  frequencies  of  the  2  terms  involved,  and  the  frequency 
of  the  string  itself.  This  processing  has  been  described  in  Section  I. 
Of  those  two -word  strings,  a  content -bearing  unit  was  defined  as  a 
string  AB  which  occurred  fab  times,  consisting  of  two  terms  with  fre¬ 
quencies  fa  and  fb,  if  the  following  conditions  were  met: 


where 


1)  fab  ■>  3 

2)  fa  ^  2040 

3)  fb  £  2040 

4)  Cab  >  20 


Cab 


fab  x  450.000 

fa  x  fb 


In  order  to  show  the  exact  nature  of  this  data  and  as  pos¬ 
sible  sample  data  for  those  who  wish  to  test  other  methodologies  we 
present  a  portion  of  the  two -word  strings  derived  from  the  GE  data  base. 
The  listing  below  (Figure  1)  shows  all  recurrent  pairs  whose  first 
word  begins  with  the  letter  "0".  The  pair  Itself,  the  frequencies  of 
the  two  words  involved,  and  the  frequency  of  the  pair  are  respectively 
listed.  In  addition,  the  cohesion  value  (Cab)  of  those  pairs  which 
meet  conditions  (1)  -  (4)  is  given. 


Word  A  Word  B 

016 

T 

02 

H2 

02 

N2 

02 

AN0 

OBEYING 

A 

OBJECTIVE 

IN 

OBJECTIVE 

IS 

FIGURE  1 


a  b  fab  Cab 

2000002j00008l3000000000000000002 
2000016  000032  000000000000000 00 3h55^ 
20000 16 00001 <  000000000000000002 
2000016016622000000000000000003 
2000004006232000000000000000002' 
200003500887^  000000000000000003 
200003^001  704/000000000000000003* 


*  By  Robert  H.  Curtice.  Not  previously  Issued. 
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Table  1 


OBTAINEO 

FROM 

OBTAINED 

IF 

OBTAINEO 

IN 

OBTAINEO 

IS 

OBTAINEO 

ONLY 

OBTAINED 

ON 

OBTAINED 

OVCR 

OBTAINED 

TO 

OBTAINEO 

UNDER 

OBTAINEO 

WHEN 

OBTAINEO 

WHICH 

OBTAINED 

WITH 

OBTAINED 

THROUGH 

OBTAINED 

USING 

OBTAINING 

ADEOUATE 

OBTAINING 

DATA 

OBTAINING 

HOT 

OBTAINING 

AN 

OBTAINING 

A 

OBTAINING 

THE 

OBTAINING 

SOLUTIONS 

OBTAINING 

STEELS 

OBTAINING 

STRESS 

OBTAIN 

ANALYTICAL 

OBTAIN 

DATA 

OBTAIN 

DESIGN 

OBTAIN 

GENERAL 

OBTAIN 

INFORMATION 

OBTAIN 

INSIGHT 

OBTAIN 

MAXIMUM 

OBTAIN 

NUMERICAL 

OBTAIN 

QUALITATIVE 

OBTAIN 

AN 

OBTAIN 

A 

OBTAIN 

THE 

OBTAIN 

SATISFACTORY 

OBTAIN 

SOLUTIONS 

OBTENUS 

PAR 

OCCUPIED 

BY 

OCCURENCE 

OF 

OCCURRED 

AT 

OCCURREO 

BY 

OCCURRED 

WHEN 

OCCURRED 

KITH 

OCCURRENCE 

AND 

OCCURRENCE 

OF 

OCCURRING 

DURING 

OCCURRING 

IN 

OCCUR 

DISCUSSED 

OCCUR 

M 

OCCUR 

AND 

OCCUR 

AT 

(Continued) 

2000546100 1 80  7100000  00  C  00000000  52f£5 
200054500C1 1300000000 00 000000 02; 
2000  545008879000000000000000041*; 
2000 5450C 1704 0000000000 00000002! 
2000545000241000000000000000002: 
20005 45 00463 4P 90000 GO 00 000000271 
2  000  545(0004240000  000  0000  00000051 


2  000  545j00  80451)00  00000  00000000051 
2000545|o00596c000000000000000c5t 
2000546000 303000000000000 0000051 
2000545001 1800000000000000000021 
20005450038440000000000000000281 
20005451000402000000000000000002! 

20  00  5  451000  57  21)000000000000000101 
200 00 64 00004 3D 00 00000000000000?! 
200006400094 3POOGOOUOOOOOOOOOO  MZZ 
20000640002 36DOGOCOOOOOOOOOOOU2! 
2000064001 474PCOOOOOOOOOOOOOOC?! 
20000 64l006  2 3 2b000 0000 000 OOOOOO  )i 
2000D64f00  89l  U»00000000000000004l 
2000064000 39Cb000000000000000u?l 
2000064,000)1  lpOUOOOOOOOOOOOOOO?' 
2000 0 6400089 Op OOOCOOOOOOOOOOOC2I 
20001  06j000  l  8  110 00 000 000000000 00?  1 
2000106j000943p0000O00000000000  3« 
2000  l C6|0C  11  7300000000000000000^-1 
2000 10 60 0038 700000000 00 00000002i 
2000 1 0600022400000000000000000 
20001 06|00C  00  8)0000000  0000  0000  0021 
2000  1  06^0002  0?p0000')0  OOOOOOOOOO  2< 
200010  60  CO  1 8  9)000  000  00  0000000  GO  21 
2000 l 060000 32 OOOOOOOOCOOOOOOOoZi 
2000  1 0600  14  74D000000C  000000  000  3(i 
20C0  10600623  2  000  000  00  OOOOOoO  0081! 
200010600891 10000000COOC000000  3(t 
20001 0600008 3GOOOOOOOGOCOOO 000?' 
2O0O1O60O0390D0C00O0O0000G00OC2' 
2000002)0000  19GOOOOOOCOOOOOOOC02;« 

200000^002 75900C0 00 COOOOO 00 000^ 

20  COO  040  30  17  OOOOOOOOOOOOOOO  0  00 
2000  0  24(0027700  000000000000000  f,6|i 
2OOOU24,OO2  759GOOGO0GOO0OOO0Ou02ji 
20000  2400C 3031)00000000000000  0C21 
20000  240u3844D000000G0C  00000  OC<ii 
200001>':016622  0000000000000000n  I 
200O01C03O170QG0C0000O00000GO')‘J' 
2000011000286000000000000000002- 
20000 11 0088790000 COOGGOOCO 0001] ' 
20000 4f 00049 7DOOOOCOOOOOOCUOOC2 
2000046 00 7364000000000000000C 02 
2000046016622000000000000000002- 
2000046 002 77CCOOOOOOOOOOOGOOOC2 < 
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OCCUR 

DURING 

OCCUR 

IN 

OCCUR 

WHEN 

OCCURS 

FIRST 

OCCURS 

AND 

OCCURS 

AS 

OCCURS 

AT 

OCCURS 

FOR 

OCCURS 

IN 

OCCURS 

WHEN 

OCCURS 

WITH 

OCTENE 

1 

OCTOBER 

1952 

OCTOBER 

1 

OCT 

1954 

OCT 

1955 

OCT 

1 

OFFERED 

NO 

OFFICE 

OF 

OFF 

AIRPLANE 

OFF 

DESIGN 

OFF 

LIMITS 

OFF 

M 

OFF 

AND 

OFF 

SIZE 

OFF 

THRUST 

OFF 

VALVE 

OFFSET 

DIFFUSERS 

OFFSET 

YIELD 

OF. 005 

PCT 

OFTEN 

FOUND 

OGIVE 

CYLINOER 

OHIO 

IT 

OHIO 

STATE 

OHMS 

LAW 

OIL 

♦ 

OIL 

ASH 

OIL 

BURNING 

OIL 

COLUMN 

OIL 

DEVELOPMENT 

OIL 

FILM 

OIL 

FLOW 

OIL 

HOSE 

OIL 

M 

OIL 

OPR 

OIL 

RESULTS 

OIL 

AND 

OIL 

AT 

OIL 

OIL 

IN 

Table  1  (Continued) 


2 0000460002 8 dOOO 000 00 00 0000 00 02 1 
200004  fjOO  8879000000  00  000  000  00 13( 


200004^00030  iooocooooooooooooosHW 
200005400017' 

2000054  01 662  d( 

2000054  00 168*1  C 
2000054  002770C 
2000054  007401 
2000054008879C 


30000000000000000021 

000000000000000003' 

000000000000000002' 

0000000000000000051 

30000000000000000041 

0000000000000000121 


2000054 


2000054 
200000 
200001 
200001 
2000011100001 
200001100001 
20000 11000632 
2000001 00037 


00030 

003844 

31000632 

100001 

10C0632 


3(0  00  00000000  00000031 
0000000000000000021 
000  00  000  00  00000  00  3f,M2 

T(0000000c0000000002( 

000000000000000002' 

410000000000000000021 

4)0000000000000000021 

OOOOOGOOOOOO 00 00021 
30000000000000000021 


40 


(JO 


200000 

2000083 

200008 

200008 

200008 

2000O8 

20C008 

200008 

200008 

2000004 

200000 

200000 

200000 

200000 

20000081 


3017q0000000000000000c3< 
0000840000000000000000021 
8j00117  3K)00e00000000000010«-fcS 
8000092 OOOOOOCOOOOOOOOOC 21 
8(007364  0000000000000000021 
166230000000000000000 13( 
0020T|000000000000000002( 


80 _ 

80004H  0000000000000000054«fc| 
80000740000000000000000021 
,00003C000000000000000002( 

900013* 0000000000000000044**® 

2J000241 00000  f)000000000002t 
9100026:  0000000000000000021 
5I00022C  00000000000000000 
00026C 0000000000000000021 


2000003 

2000120 

2000120 

200012 

2000120 

200012C 

200012 

200012C 

2000120 

200012 

200012C 

200012C 

200012C 

200012C 

200012 


200012CK)0000( 


C  01 


CO] 


2000008(0002640000000000000000021 
00005*  0000000000000000021 
0G042< 0000000000000000021 
oooon  ooooooooooooooouo3*750 

0000142  000000000000000004«-VftS 
00004 : 0000000000000000021 
00089] 00C0000000000C00021 
d00015: OOOOOOOOOOOOOOOOC8*W* 
002184  0000000000000000041 
0000 1( 000 30000C00000 00021 
0736* 0000000000000000021 
000062  000000000000000002( 
OOlilC  000000000000000002( 
16622  0000000000000000081 
0 02 77 (u 00 00 00 0000 00000021 
9008879000000000000000 0031 
0000000000000000021 
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Table  1  (Continued) 


OBJECTIVE 

OF 

OBJECTIVE 

TO 

OBJECTIVE 

WAS 

OBJECTIVES 

AND 

OBJECTIVES 

OF 

OBJECT 

RETENTION 

OBJECT 

OF 

OBJECT 

TU 

OBJECT 

WAS 

OBJECTS 

PRODUCED 

OBJECTS 

OF 

OBLATENESS 

ANP 

OBLATE 

SPHEROIDAL 

OBLIQUE 

COORDINATES 

OBLIQUE 

SHOCK 

OBRABUTKA 

METALLOV 

OBSCURED 

BY 

OBSERVATION 

OF 

OBSERV AT  IONS MADE 

OBSERVATIONS 

AND 

OBSERVATIONS 

OF 

OBSERVATIONS 

ON 

2000035(3 
2000035 
20000200 
2000020  3 
20000473 
2000C47  3 
20000473 
2000047  3 
2000008  3 
20000083 
2000004) 
2000003  3 
2000020  3 
2000020) 
2000003  3 
2000002) 
2000024) 
2000069) 
2000069) 
2000U69  3 


20000  35(330  170!0000000000000000 11'; 

08045)00000000  0000000003 
300398000000000000000005  IM 
1 662 2 0 00 000 U 000 0000 0004 
30 1 7 0000C 0000000000 00 10 
C0006000000000000000002 
30 170000000000000 00 0029 
08 04 500C000 000000000005 
00398000000U00000000002 
00191)00000000000000002 
30170000000000000000002 
166220000000000000000C? 
000020  00  OOuOOOOOO  00  00021 
0004 700000 J000000000002 
0059 70000000000000000 12^®« 
OUOllOCOOOoOOOOOOOOOOO  340)61 
02  75  90000000000000000021 
30170000000000000000015: 

00  51 80 00000 OoOOO 0000003  ST 
1662200000C000000000002I 
30170000000000000000013! 
2000C  69004  6  3  4*3000000000000000261 
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M 
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BY 
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2000 
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:2 759bc00000oC0000000C2 
101 1)0  02  86(000  00  0000000000002 
101DC  7406)0000000  00000000004 
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OBTAINABLE 

OBTAINED 
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THAT 

TU 

UNDER 

WITH 

STRENGTH 
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NOISE 

AF33 

DIRECTLY 

EFFECTS 


20O0101U0C  ) 
20001 0 lbp  304 
200010100059 
2000  1C  1003844 
2000  1C lpCC68 
2000101)00024 
2000C0600U20 
200054503009 
200054500004 
2000545.000831 


54000000000000000002’ 
500  0  0  000  O  00  0000  0002! 
6)0000000000000  0000  7I 

000000000D00O0O004! 
90000000000000000? 
0000000000000000003 
50000000000000000C2 
4000000L0000000000? 

800000000  00  000000c  ?l 
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55 


OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 

OBTAINED 


EXPERIMENTAL 

M 

PREVIOUSLY 
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AS 

AT 
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NASA  VOCABULARY  TWO -WORD  STRINGS,  THEIR  USAGE,  AND  RELATION  TO 
SYSTEM  CBU'S  IN  THE  GE-2  AUTO-INDEXED  MESSAGE  COLLECTION  * 


A.  Introduction 

This  section  reports  a  study  of  the  two-word  strings  in  the  NASA 
indexing  vocabulary  and  their  relationship  to  the  GE-2  prototype 
retrieval  system.  The  principal  motivation  and  the  discussion  of  the 
results  of  the  data  gathered  in  this  study  appear  in  Chapter  VI  of 
the  Evaluation  Study  Report.  We  wish  to  note  that  V.  E.  Giuliano  was  a 
major  participant  in  conducting  the  study  whose  details  are  recorded  here. 
Briefly,  the  two-word  strings  in  the  NASA  term  list  are  regarded,  in 
the  report,  as  representatives  of  Subject  Heading  queries  that  might 
be  posed  to  the  GE-2  retrieval  system.  Their  overlap  properties  with 
the  GE-2A  machine-indexing  vocabulary  is  thus  of  considerable  interest. 
Similarly,  it  is  of  importance  to  determine  how  many  of  these  two-word 
NASA  terms  turn  out  to  be  System  CBU's  in  the  GE-2  system.  The  data 
for  determining  these  relationships  are  presented  here  together  with 
other  observations  about  the  set  of  two-word  terms.  Because  of  their 
importance  to  studies  now  under  way,  Appendix  A  contains  an  exhaustive 
listing  of  the  pairs  that  were  studied,  together  with  the  observations 
made  on  each. 

B.  NASA  Postings  to  Two-Word  Terms 

The  first  step  was  to  estimate  the  total  number  of  postings  of  two-word 

strings  in  the  total  NASA  collection.  Consequently,  the  NASA  frequency- 

ordered  indexing  vocabulary  list  was  divided  rather  arbitrarily  into 

ten  intervals.  Samples  were  taken  from  each  of  the  intervals  and  were 

used  to  estimate  the  average  number  of  postings  to  the  terms  in  each 

frequency  interval.  From  Figure  3  of  Technical  Note  CACL-29,  the 

*  Issued  on  June  15,  1966  to  a  limited  distribution  as  Technical  Note 
CACL-31  by  Peter  R.  Bono  and  Paul  E.  Jones,  Jr.  References  only 
have  been  updated. 
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average  number  of  two-word  strings  per  page  could  be  calculated  (making 
appropriate  allowance  for  the  fact  that  not  all  MWT*s  are  two-word 
strings) . 


Table  A  records  all  the  information  necessary  for  the  estimation  of  the 
total  number  of  postings  to  two-word  strings  in  each  interval.  The 
number  of  postings  is  calculated  by  taking  the  product  of  the  average 
number  of  two-word  strings  per  page  and  the  number  of  pages  in  the 
interval  and  the  average  term  frequency  over  the  whole  interval.  For 
example,  for  sample  2,  we  multiply  39  x  6  x  150  and  get  35,200  --  the 
estimated  number  of  postings  in  the  interval  of  pages  13-18.  From 
Table  A,  we  see  that  the  total  estimated  number  of  postings  to  two-word 
strings  in  the  NASA  collection  is  213,505  tokens. 


C.  Relating  NASA  Subject  Headings  to  GE  Collection  Parameters 

The  next  step  was  to  classify  the  two-word  strings  into  six  groups  we 
wished  to  differentiate.  In  order  to  describe  these  groups,  it  is 

A 

helpful  to  use  the  following  notation.  Represent  a  word  pair  as  ab 

where  "a"  and  "b"  respectively  represent  the  first  and  second  words  of 

the  pair.  Let  f^  denote  the  frequency  in  the  GE-2  corpus  of  the  two- 

word  string  ab ,  and  let  C  ,  denote  the  coherence  measure  of  ab 

a  b 

Then  the  six  groups  of  two -word  strings  can  be  described  with  precision 
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TABLE  A 

NUMBER  OF  POSTINGS  TO  MULTIPLE -WORD  TERMS 


Sample 

Number 

interval  in  nASAT 

Vocabulary  Book 
(by  page  number) 

Number  ot 

Pages  in 
Interval 

Average  Number 

of  Two -Word 
Terms  Per  Page 

Average  term 

Frequency  Over 
Whole.  Interval 

Estimated  Number 

of  Pastings 
in  Interval 

Number 

in 

Sample 

i 

p.  l-p. 12 

12 

49,705* 
(actual  count) 

122 

2 

p.  13-p. 18 

6 

2:39 

*150 

35,200 

36 

3 

p. 19-p. 24 

6 

*47 

o*93 

26,200 

60 

4 

p. 25-p. 36 

12 

=54 

*57 

37,000 

72 

5 

p.37-p.48 

12 

=56 

2:34 

22,800 

48 

6 

p. 49-p. 72 

24 

~59 

=16 

22,400 

24 

7 

p. 73-p . 96 

24 

X59 

ot  8 

11,200 

12 

8 

p. 97-p. 120 

24 

*59 

2:4 

5,600 

12 

9 

p. 121-p. 144 

24 

=59 

=1.5 

2,100 

12 

10 

p. 145-p. 169 

25 

*52 

*4 

1,300 

12 

Total  Estimated  Number  of  Postings  =  213,505 


*NOTE :  The  number  of  posting  for  the  122  two-word  terms  in  the  first  twelve  pages 

of  the  NASA  index  term  dictionary  was  actually  counted;  consequently,  the 
two  statistics  noted  above  are  unnecessary. 
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Group 

G 


Definition 


N/A 


A/N 


N/N 


Word  "a" 

in  ( 

word  "b" 

in  ( 

f.b*  2 

Cab 

Word  "a” 

not 

word  ,,b,, 

in  ( 

Word  "a" 

in  ( 

word  "b" 

not 

Word  "a" 

not 

word  "b" 

not 

in  GE-2A  Vocabulary  List; 


Word  "a" 
word  "b" 

fab*2’ 

Word  "a" 

word  "b" 

but  f  . 
-  ab 


in  GE-2A  Vocabulary  List; 

in  GE-2A  Vocabulary  List 

but  C  ,  <  20 
ab 

in  GE-2A  Vocabulary  List; 
in  GE-2A  Vocabulary  List 
’  1 


This  classification  of  the  NASA  two-word  string  permits  us  to  explore 
the  relationship  between  NASA  subject  headings  and  the  GE-2  System 
CBU's.  The  number  of  postings  NASA  had  given  each  string  was  also  of 
interest  in  developing  this  relationship.  Accordingly,  we  proceeded  to 
estimate  the  total  NASA  usages  (postings)  for  each  of  the  six  kinds  of 
strings.  Using  the  same  data  as  used  in  the  preparation  of  Table  A, 
we  calculated  this  figure  by  taking  the  number  of  two-word  strings  of 
each  kind  in  the  sample  for  our  interval,  dividing  by  the  sample  size 
for  the  interval  and  multiplying  by  the  estimated  total  number  of 
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postings  to  two-word  strings  in  that  interval.  For  example,  in  sample 
2  (pp. 13-18),  10  of  the  36  sample  strings  belong  to  group  G.  The 
total  estimated  number  of  postings  for  this  interval  was  35,200 
(cf.  row  2,  Table  A).  Consequently,  the  total  estimated  number  of 
postings  for  group  G  for  this  interval  is 

10  x  X  35,200  -  9,950 

Table  B-I  displays  these  data  for  each  of  the  six  groups  and  for  each 
of  the  ten  intervals.  Also,  the  ratio,  ,  of  total  number  in  Group  X 
to  the  total  size  of  sample  k  is  recorded (where  X  -  G,  A/N,  N/A ,  C,  or 
F;  and  k  -  1,  .  .  . ,  10).  In  the  above  example,  this  would  be 


D.  Measures  Reflecting  the  NASA/GE  Relationship 

We  also  wished  to  have  an  estimate  of  the  number  of  types  (as  opposed 
to  tokens)  for  each  group  of  two-word  strings.  The  total  number  of 
types  for  each  page  interval  was  estimated  using  Figure  3  of  TN  CACL-29 
(again  multiplying  the  total  number  of  MWT's  by  an  appropriate  scaling 
factor  to  allow  for  the  fact  that  some  of  the  MWT's  are  composed  of 
more  than  two  words). 

The  estimates  of  the  number  of  types  for  each  group  were  calculated 
simply  by  multiplying  the  total  estimated  number  of  types  by  the  ratio 
(from  Table  B-I).  These  figures  are  displayed  in  Table  B-II.  For 
example,  in  sample  2,  there  are  approximately  234  two-word  strings. 
Since  10  of  the  36  strings  sampled  belonged  to  Group  G,  we  would 

expect  that  *■“  x  234  -  P  x  234  -  65  of  the  234  would  belong  to  Group 
Jo  Z 

G. 
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TABLE  B-I 


Sample 

Number 

Total  Est. 
Number  of 
Postings 

Total 

Sample 

Size 

G 

N/A 

A/N 

N/N 

C 

F 

i 

49,705 

(actual) 

122 

No.  in  Sample 

Actual  No.  of  Postings 
p  _  Act.  No.  of  Postings 

1  Total  No.  of  Postings 

67 

33,691 

.68 

27 

8,423 

.12 

7 

2,102 

.06 

4 

1,303 

.03 

8 

2,206 

.06 

9 

1,980 

.05 

2 

35,200 

36 

No.  in  Sample 

Est.  No.  of  Postings 
p  _  No.  in  Sample 

2  Total  Sample  Size 

10 

9,900 

.28 

9 

8,800 

.25 

6 

6,000 

.  17 

1 

1,000 

.03 

7 

6,700 

.19 

3 

2,800 

.08 

3 

26,200 

60 

No.  in  Sample 

Est.  No.  of  Postings 
p  _  No.  in  Sample 
- 3 - Total  Sample  5 lee 

22 

9,700 

_ .37 

6 

2,620 

.10 

11 

4,700 

.18 

4 

1 ,830 

.07 

9 

3,950 

.15 

8 

3,400 

.13 

u 

37,000 

72 

No.  in  Sample 

Est.  No.  of  Postings 
p  _  No.  in  Sample 

4  Total  Sample  Size 

19 

10,000 

.27 

18 

9,250 

.25 

9 

4,450 

.12 

10 

5,300 

.  14 

8 

4,000 

.11 

8 

4,000 

.11 

5 

22,800 

48 

No.  in  Sample 

Est .  No.  of  Posting 
p  No.  in  Sample 

5  Total  Sample  Size 

9 

4,300 

.  19 

15 

7,100 

.31 

9 

4,300 

.  19 

4 

1,900 

.083 

4 

1 ,900 

.083 

7 

3,300 

.144 

Section  IV:  CACL-  31 
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TABLE  B-I  (continued) 


No.  in  Sample 

3 

7 

3 

2 

2 

7 

6 

22,400 

24 

Est.  No.  of  Postings 
p  _  No.  in  Sample 

6  Total  Sample  Size 

2,800 

.125 

6,500 

.29 

2,800 

.125 

1,900 

.083 

1,900 

.083 

6,500 

.29 

No.  in  Sample 

1 

6 

1 

1 

0 

3 

7 

1 1 ,200 

12 

Est.  No.  of  Postings 

933 

5,600 

933 

933 

0 

2,800 

p  No.  in  Sample 

.083 

.50 

.083 

.083 

.00 

.25 

7  Total  Sample  Size 

No.  in  Sample 

0 

6 

3 

1 

0 

2 

8 

5  ,600 

12 

Est.  No.  of  Postings 

0 

2,800 

1,400 

470 

0 

930 

P  No.  in  Sample 

8  Total  Sample  Size 

.00 

.50 

.25 

.083 

.00 

.167 

No.  in  Sample 

0 

6 

3 

3 

0 

0 

9 

2,100 

12 

Est.  No.  of  Postings 

0 

1,050 

525 

525 

0 

0 

P  No.  in  Sample 

.00 

.50 

.25 

.25 

.00 

.00 

9  Total  Sample  Size 

No.  in  Sample 

0 

8 

1 

1 

0 

2 

10 

1,300 

12 

Est.  No.  of  Postings 

0 

864 

109 

109 

0 

218 

p  No.  in  Sample 

10  Total  Sample  Size 

.00 

.667 

.083 

.083 

.00 

.167 

213,505 

400 

Total  of  Est.  Postings 

71,324 

53,007 

27,319 

15,270 

20,656 

25,928 

Section  IV:  CACL-31 
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TABLE  B-II  £ 

o 

3 


Sample 

Number 

Total  Est. 
Number  of 
Postings 

Est.  Total 
Number  of 
Types 

G 

N/A 

A/N 

N/N 

C 

< 

F  2 

n 

i 

49,705 

(actual) 

122 

(actua 1) 

Actual  No.  of  Types 

67 

27 

7 

4 

8 

9  i 

h-* 

2 

35,200 

234 

Est.  No.  of  Types 

65 

59 

39 

7 

46 

18 

3 

26,200 

282 

Est.  No.  of  Types 

104 

28 

51 

20 

42' 

37 

4 

37,000 

658 

Est.  No.  of  Types 

178 

165 

79 

92 

72 

72 

3 

22,800 

660 

Est.  No.  of  Types 

125 

203 

125 

56 

56 

95 

6 

22,400 

1,400 

Est.  No.  of  Types 

175 

408 

175 

117 

117 

408 

7 

11,200 

1 ,400 

Est.  No.  of  Types 

116 

700 

117 

117 

0 

350 

8 

5,600 

1 ,400 

Est.  No.  of  Types 

0 

700 

350 

116 

0 

234 

9 

2,100 

1,400 

Est.  No.  of  Types 

0 

700 

350 

350 

0 

0 

10 

1,300 

1,250 

Est.  No.  of  Types 

0 

833 

104 

104 

0 

209 

1 

213,505 

8,806 

Column  Totals 

830 

3,823 

1,397 

983 

341 

1 ,432 

Section  IV:  CACL-3I 


There  are  a  number  of  significant  measures  which  can  be  calculated  from 
data  in  Tables  B-I  and  B-II.  These  measures  --  four  in  all  --  are 
defined  as  follows: 


G  +  C  +  F 

N 


Of  the  total  number  of  two-word 
strings,  the  percentages  of  strings 
at  such  that  both  "a"  and  "b"  arj 
members  of  the  GE-2A  Machine 
Indexing  Vocabulary 


P  *  n  .  n  .  “  Of  the  total  number  of  two-word 

strings  at  such  that  "a"  and  "b" 
are  members  of  the  GE-2A 
vocabulary,  the  percentage  of 
strings  such  that  f  ^  2 


ar 


G 

G  +  C  +  F 


Of  the  total  number  of  two-word 

strings  ab  such  that  "a"  and  "bM 

are  members  of  the  GE-2A 

vocabulary,  the  percentage  of 

strings  such  that  f  ,  ^  2  and 

ab 

C  >20 
ab 


r 


G  -f  C  -f  F  -f  A/N  -f  N/A 

N 


Of  the  total  number  of  two-word 
strings  the  percentage  of  strings 
such  that  a£  least  one  of  the 
words  is  in  GE-2A. 


|Note:  N  -  total  number  of  tokens  (or  typesJJ 
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Section  IV:  CACL-31 


These  measures  have  been  calculated  for  both  tokens  and  types  and  are 
displayed  in  Table  B-III. 


Finally,  the  percentages  (P^)  given  in  Table  B-I  were  plotted  versus 
the  term  usage  frequency  in  the  NASA  collection  (equivalent  to  the  page 
number  in  the  printout  of  the  frequency-order  NASA  18K  index  term 
dictionary) . 

The  graph  was  produced  by  a  smoothing  procedure  which  took  an  average 
of  the  P^'s  over  three  consecutive  values.  That  is,  the  new 


‘  k  Pk/  3 

is  smoothed  by  taking 
P  '  4-  P 


pk-i  +  pk  +  pk+i 


tcj>r  k»l ,  9.  The  final  point 


10 


10 


These  smoothing  calculations  are  recorded  in  Table  C-I. 


From  the  resulting  graph  (Table  C-II),  It  can  be  seen  that  for  two-word 
strings  that  have  a  high  frequency  of  usage,  the  probability  is  quite 
high  that  both  words  of  the  string  will  belong  to  the  GE-2A  Vocabulary 
and  that  its  f^  ^  3  and  *  20  (i.e.  the  string  belongs  to  group  G) . 
However,  as  we  consider  strings  with  lover  and  lower  NASA  frequency, 
this  probability  falls  practically  to  zlero.  That  is,  group  G  draws 
most  of  its  members  from  the  high-frequency  two-word  strings.  The 
graph  also  shews  that,  except  for  initial  disturbances  among  the  high 
frequency  strings,  groups  A/N  and  N/N  draw  equally  from  the  whole 
NASA  indexing  vocabulary.  That  is,  these  two  groups  appear  to  be  largely 
independent  of  NASA  frequency  of  usage. 
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TABLE  B-III 

<*-,  AND/ -TYPE  MEASURES  CALCULATED  FOR 

THE  NASA  INDEX  VOCABULARY  FOR  BOTH  TOKENS  AND  TYPES 


TOKENS 

TYPES 

<< " 

117,908 

213,505 

.552 

ot  - 

2,603 

8,806 

^  .296 

p  “ 

91,980 

117,908 

^  .780 

P  - 

1,171 

2  ,603 

2Z,  .450 

y  - 

71 ,324 

^  . 605 

y- 

830 

~  .319 

117  ,908 

2,603 

S  - 

198,235 

213,505" 

.933 

X- 

7,823 

8,806 

^  .888 

TABLE  C-I 

SMOOTHED  DATA  FOR  THE  SIX  CURVES  PRESENTED  IN  TABLE  C-II 


G 

Pk 

Pk 

N/A 

Pk  Pk 

A/N 

Pk 

K 

N/N 

Pk  Pk 

C 

Pk 

Pk 

l 

Pk 

1 

.68 

.68 

.12 

.12 

.06 

.06 

.03 

.03 

.06 

.06 

.05 

.05 

2 

.28 

.44 

.25 

.16 

.17 

.14 

.03 

.04 

.19 

.13 

.08 

.09 

3 

.37 

.31 

.10 

.20 

.18 

.16 

.07 

.08 

.15 

.15 

.13 

.11 

4 

.27 

.24 

.25 

.23 

.12 

.16 

.14 

.10 

.11 

.11 

.  11 

.13 

5 

.19 

.20 

.31 

.28 

.19 

.14 

.083 

.10 

.083 

.09 

.14 

.18 

6 

.13 

.13 

.29 

.37 

.12 

.13 

.083 

.08 

.083 

.06 

.29 

.23 

7 

.083 

.07  i 

.50 

.43 

.083 

.15 

.083 

.08 

.00 

.03 

.25 

.24 

8 

.00 

.03 

.50 

.50 

.25 

.19 

.083 

.14 

.00 

.00 

.17 

.14 

9 

.00 

.00 

.50 

.56 

.25 

.19 

.25 

.14 

.00 

.00 

.00 

.11 

10 

.00 

.00 

.67 

.61 

.083 

.14 

.083 

.11 

.00 

.00 

.17 

.14 

for  k*l ,  -- ,  9 ; 
percentage  at  " 
k-10; 


p,k = 
time"  k 


k-I 


P\ 


3 

+  P 


P' 


10 


10 


k+1 


,  where  =  observed 
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7.  of  Scar  Two-Word  Subject  Heading  in  Page 
(or  Frequency)  Interval  Indicated  of  Type 


TABLE  C-II 


•Hoi.:  iir:iPr intout 
era  Usage  Frequency 
in  NASA  collection 


APPENDIX  A 


SAMPLE  1 


NASA  f 

Term  Name 

i . 

CBU 

f 

f. 

C  , 

ab 

a 

b 

ab 

2264 

heat  transfer 

630 

G 

1547 

791 

231 

2218 

high  temperature 

G 

57 

1462 

rocket  engine 

G 

100 

1328 

space  flight 

20 

G 

450 

374 

54 

967 

space  vehicle 

71 

G 

450 

259 

280 

932 

solid  propellant 

105 

G 

431 

384 

290 

877 

high  energy 

G 

34 

843 

radiation  effect 

G 

24 

778 

magnetic  field 

G 

464 

764 

cross  section 

G 

1047 

706 

mach  number 

G 

524 

679 

wave  propagation 

23 

G 

371 

174 

160 

600 

computer  program 

G 

232 

586 

high  speed 

G 

88 

586 

high  frequency 

G 

33 

564 

cosmic  radiation 

N/A 

626 

558 

reynolds  number 

G 

541 

549 

space  environment 

11 

G 

450 

115 

96 

544 

manned  spacecraft 

N/N 

504 

low  temperature 

G 

54 

504 

upper  atmosphere 

N/A 

489 

differential  equation 

G 

328 

475 

high  altitude 

G 

38 

467 

carbon  dioxide 

A/N 

1248 

450 

launch  vehicle 

N/A 

1008 

447 

reentry  vehicle 

N/A 

440 

hypersonic  flow 

G 

59 

436 

gas  flow 

G 

22 

427 

data  processing 

G 

38 

426 

supersonic  flow 

119 

G 

517 

2402 

43 

422 

aluminum  alloy 

G 

133 

419 

electronic  equipment 

G 

243 

418 

VTOL  aircraft 

32 

G 

94 

636 

230 

416 

apollo  project 

N/A 

415 

doppler  effect 

N/A 

410 

automatic  control 

G 

131 

406 

aerodynamic  characteristics 

G 

81 

404 

refractory  metal 

G 

278 

398 

geomagnetic  field 

N/A 

398 

wind  tunnel 

274 

G 

303 

407 

1000 

396 

electromagnetic  wave 

G 

138 

391 

low  frequency 

G 

73 

382 

gamma  radiation 

N/A 
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Appendix  A 


SAMPLE  1  (Continued) 


NASA  f 

Term  Name 

t . 

CBU 

f 

f. 

C  . 

ab 

a 

b 

ab 

380 

turbulent  flow 

56 

G 

316 

2402 

33 

377 

thin  film 

9 

G 

221 

226 

81 

369 

rocket  nozzle 

G 

110 

368 

boundary  layer 

G 

709 

367 

flight  test 

G 

43 

367 

optimal  control 

N/A 

366 

NASA  program 

N/A 

359 

supersonic  transport 

19 

G 

517 

172 

96 

350 

aircraft  design 

14 

C 

636 

1173 

350 

high  pressure 

72 

C 

1757 

1563 

349 

blunt  body 

G 

436 

347 

aerospace  medicine 

N/N 

346 

mechanical  property 

G 

264 

338 

solar  flare 

A/N 

337 

single  crystal 

43 

G 

24 

105 

7670 

331 

electron  beam 

G 

554 

331 

material  testing 

3 

C 

102 

432 

327 

communications  satellite 

N/A 

318 

plastic  deformation 

G 

351 

317 

materials  science 

A/N 

312 

shock  tunnel 

16 

G 

597 

336 

36 

311 

liquid  propellants 

G 

100 

308 

charged  particle 

N/A 

307 

high  strength 

G 

54 

303 

satellite  observation 

F 

303 

satellite  orbit 

3 

G 

77 

21 

835 

300 

attitude  control 

N/A 

300 

laser  output 

N/A 

298 

low  pressure 

G 

23 

297 

digital  computer 

G 

1720 

293 

gas  dynamics 

G 

97 

293 

titanium  alloy 

146 

G 

599 

1691 

65 

289 

fluid  mechanics 

A/N 

286 

electron  density 

G 

76 

283 

solar  radiation 

8 

G 

88 

303 

133 

283 

space  science 

A/N 

282 

temperature  effect 

12 

C 

2033 

1866 

1.4 

280 

cylindrical  shell 

G 

413 

278 

human  performance 

N/A 

278 

phys io logical  response 

N/A 

278 

radiation  measurement 

F 

276 

temperature  measurement 

32 

C 

2033 

704 

10 

275 

mercury  project 

N/A 

275 

spacecraft  propulsion 

N/A 

274 

infrared  radiation 

N/A 

274  reliability  engineering  F 
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SAMPLE  1  (Continued) 


NASA  f 

Term  Name 

fab 

CBU 

f 

a 

lb 

lab 

272 

motion  equation 

F 

271 

communication  system 

N/A 

270 

energy  conversion 

G 

167 

218 

temperature  distribution 

93 

G 

2033 

550 

38 

217 

crack  propagation 

G 

984 

215 

E  layer 

F 

215 

planetary  atmosphere 

N/A 

213 

boundary  value 

G 

113 

213 

flight  control 

2 

C 

374 

37 

213 

liquid  metal 

G 

64 

213 

shell  theory 

3 

C 

132 

808 

13 

212 

chemical  kinetics 

G 

315 

211 

Maxwell  equation 

N/A 

210 

radioactive  isotope 

N/N 

210 

satellite  measurement 

F 

209 

computer  method 

F 

209 

linear  system 

G 

23 

209 

thermal  stress 

55 

G 

633 

1227 

32 

208 

tensile  strength 

68 

G 

359 

761 

112 

207 

control  device 

G 

207 

satellite  communication 

A/N 

206 

radar  system 

N/A 

203 

aerodynamic  heating 

G 

290 

202 

human  tolerance 

N/N 

202 

ultraviolet  radiation 

N/A 

201 

nuclear  explosion 

A/N 

201 

transport  aircraft 

13 

G 

151 

636 

59 

200 

niobium  alloy 

3 

C 

75 

1098 

200 

orbit  calculation 

F 

199 

atmospheric  composition 

F 

199 

environmental  testing 

N/A 

198 

meteorological  satellite 

N/A 

197 

g  rav 1 1 a  t iona 1  field 

N/A 
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SAMPLE  2 


NASA  f 

Term  Name  f 

.  CBU 

ab  - 

f 

a 

f,  C  K 

b  ab 

196 

elastic  defomation 

G 

20 

195 

high  power 

C 

194 

absorption  spectrum 

A/N 

454 

194 

glass  fiber 

G 

866 

194 

nozzle  flow 

C 

194 

power  generator 

C 

178 

gas  discharge 

G 

177 

ionizing  radiation 

N/A 

626 

177 

low  density 

G 

124 

177 

radio  wave 

N/A 

177 

satellite  tracking 

A/N 

176 

test  facility 

G 

161 

human  body 

N/A 

160 

acceleration  stress 

F 

160 

nuclear  propulsion 

G 

81 

160 

pressure  measurement 

C 

160 

time  dependency 

A/N 

159 

radiation  shielding 

A/N 

146 

polymer  chemistry 

A/N 

145 

dynamic  stability 

G 

34 

145 

integral  equation 

G 

199 

145 

Newton  Theory 

N/A 

145 

STOL  aircraft 

N/A 

144 

electric  discharge 

N/A 

134 

rare  earth 

N/A 

3985 

133 

gas  mixture 

G 

63 

133 

radiation  intensity 

F 

133 

shell  stability 

C 

132 

analog  computer 

G 

1085 

132 

compressible  flow 

C 

124 

aircraft  performance 

C 

124 

Lagrange  equation 

N/A 

343 

124 

mass  spectrometry 

A/N 

124 

molecular  structure 

F 

124 

transonic  speed 

G 

124 

celestial  mechanics 

N/N 

Appendix  A 


SAMPLE  3 


NASA  f  Term  Name 

115  heat  exchanger 

115  Liapunov  function 

115  temperature  control 

115  transport  property 

114  atmospheric  density 

107  Defender  project 

107  magnesium  oxide 

107  matrix  analysis 

107  optical  pumping 

107  phase  shift 

100  radiation  field 

100  solar  system 

99  human  engineering 

99  ionospheric  sounding 

99  metal  surface 

94  structural  beam 

93  atmospheric  temperature 

93  axisymmetric  flow 

93  checkout  equipment 

93  chromium  alloy 

89  orbital  element 

89  pattern  recognition 

89  satellite  control 

89  structural  engineering 

88  atmospheric  ionization 

83  mercury  capsule 

83  solar  spectrum 

83  solar  eclipse 

83  wave  interaction 

82  ground  station 

112  wave  diffraction 

111  flow  characteristics 

111  ionospheric  storm 

111  jet  engine 

111  nimbus  satellite 

104  static  stability 

104  vortex  flow 

103  arc  jet 

103  approximation  method 

103  heat  flux 

97  molecular  beam 

97  optical  property 

97  propellant  combustion 

97  radiation  transfer 

97  spherical  shell 


ab 


10 

29 


3 

3 


2 

6 


21 


CBU 

G 

N/A 

C 

G 

F 

N/A 

G 

F 

A/N 

A/N 

G 

F 

A/N 

N/N 

G 

F 

F 

C 

N/A 

G 

N/A 

A/N 

F 

F 

A/N 

N/N 

A/N 

A/N 

G 

A/N 

G 

C 

N/N 

G 

N/A 

G 

C 

G 

G 

G 

G 

F 

C 

C 

G 


2033 

151 


703 

1350 


371 

371 


119 

61 


23 

106 


414 

2184 


ab 

250 

3.5 

64 

164 

54 

51 


26 


88 


31 

60 

101 

95 

12 

46 

15 

254 

243 


255  421 
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SAMPLE  3  (Continued) 


NASA  f 

Term  Name 

fab 

CBU 

f 

a 

fb 

fab 

92 

thrust  chamber 

12 

G 

418 

266 

46 

91 

helicopter  rotor 

G 

1236 

91 

high  performance 

C 

91 

ton  source 

G 

152 

91 

jet  aircraft 

C 

86 

oxidation  resistance 

G 

125 

86 

power  plant 

G 

565 

86 

space  biology 

A/N 

86 

test  method 

19 

C 

775 

2040 

86 

transmission  line 

4 

G 

61 

157 

188 

81 

galactic  radiation 

N/A 

81 

Lorentz  transformation 

N/A 

1397 

81 

lunar  spacecraft 

N/N 

81 

radiation  medicine 

A/N 

81 

solar  proton 

A/N 

SAMPLE  4 

NASA  f 

Term  Name 

fab 

CBU 

f 

a 

fb 

fab 

79 

velocity  profile 

38 

G 

541 

161 

195 

79 

velocity  measurement 

4 

C 

541 

437 

7 

79 

viscous  fluid 

14 

G 

138 

455 

100 

74 

biological  cell 

N/A 

74 

catalytic  activity 

N/N 

74 

elastic  shell 

G 

163 

70 

power  source 

G 

35 

70 

probability  distribution 

N/A 

70 

trans lent  response 

20 

G 

157 

202 

284 

67 

satellite  perturbation 

F 

67 

steady  flow 

34 

G 

229 

2402 

28 

67 

surface  reaction 

5 

G 

641 

113 

31 

63 

orbital  launch 

N/N 

63 

orbital  motion 

N/A 

63 

plastic  flow 

C 

60 

static  testing 

2 

G 

23 

432 

91 

60 

vibration  effect 

F 

60 

wave  attenuation 

A/N 

57 

Michigan  project 

N/A 

57 

military  technology 

N/N 

57 

plasma  arc 

G 

46 
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SAMPLE  4  (Continued) 


KSA  f 

Term  Name 

!ab 

CBU 

f 

a 

£n 

fab 

54 

aircraft  noise 

G 

48 

54 

beryllium  hydride 

A/N 

54 

electrochemical  cell 

N/A 

52 

hypervelocity  projectile 

N/N 

52 

indium  antimonide 

N/N 

52 

Information  retrieval 

A/N 

50 

radioactive  material 

N/A 

50 

random  vibration 

G 

273 

50 

remote  control 

N/A 

48 

state  equation 

F 

48 

structural  reliability 

F 

48 

thermal  convection 

F 

46 

reactor  safety 

G 

280 

46 

reentry  condition 

N/A 

46 

reinforcing  fiber 

N/A 

76 

cold  working 

G 

419 

76 

corrosion  prevention 

A/N 

76 

Euler  equation 

N/A 

914 

72 

microwave  radiation 

N/A 

72 

neutron  scattering 

N/N 

72 

phased  array 

N/N 

68 

electromagnetic  interaction 

F 

68 

flow  pattern 

C 

68 

gaseous  laser 

A/N 

65 

optical  method 

C 

65 

radar  measurement 

N/A 

65 

scatter  propagation 

N/A 

62 

thermal  expansion 

25 

G 

633 

145 

123 

61 

cardiovascular  system 

N/A 

61 

coriolis  effect 

N/A 

58 

atmospheric  moisture 

A/N 

58 

bending  moment 

G 

518 

58 

combustion  product 

G 

146 

56 

missile  control 

C 

56 

plasma  confinement 

A/N 

56 

pressure  transducer 

A/N 

69 

53 

Debye  temperature 

N/A 

53 

earth  crust 

A/N 

53 

ion  density 

C 

51 

radioactive  fallout 

N/N 

51 

radiation  spectrum 

A/N 

51 

spacecraft  stability 

N/A 

49 

plasma  engine 

C 

49 

simulated  altitude 

6 

G 

57 

178 

256 

49 

synoptic  meteorology 

N/N 

47 

rotor  aerodynamics 

F 

47 

rotating  fluid 

C 

47 

signal  distortion 

N/N 

41 

control  system 

G 

54 

41 

delay  line 

N/A 

41 

dynamic  model 

F 
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SAMPLE  5 


NASA  f 

Term  Name  f  % 

ab 

CBU 

f 

a 

£b 

fab 

44 

proton  energy 

N/A 

\ 

44 

seasonal  variation 

N/A 

42 

elliptical  orbit 

N/A 

42 

energy  exchange 

A/N 

40 

elastic  bending 

F 

40 

elementary  particle 

N/A 

38 

aerospace  system 

N/A 

38 

air  inlet 

G 

85 

37 

phase  transformation 

G 

67 

37 

piezoelectric  crystal 

N/A 

36 

Voyager  project 

N/A 

35 

aircraft  production 

F 

34 

lift  fan 

G 

204 

34 

meteor  shower 

N/N 

33 

Riemann  integral 

N/A 

33 

scientific  satellite 

N/A 

32 

trace  contaminant 

N/N 

32 

transpiration  cooling 

N/A 

30 

conducting  media 

F 

30 

cyclotron  radiation 

N/A 

29 

data  correlation 

F 

29 

dynamic  pressure 

G 

31 

28 

hydrogen  fluoride 

A/N 

28 

hypersonic  nozzle 

C 

43 

notch  strength 

G 

21 

43 

parabolic  equation 

N/A 

41 

control  system 

G 

54 

41 

delay  line 

N/A 

39 

atmospheric  electricity 

A/N 

39 

beryllium  compound 

C 

38 

storage  battery 

A/N 

38 

submillimeter  wave 

N/A 

36 

fluorescent  emission 

N/N 

36 

flight  training 

A/N 

35 

particle  emission 

A/N 

35 

periodic  oscillation 

N/A 

33 

air  cooling 

C 

33 

boron  nitride 

G 

1855 

32 

high  gain 

A/N 

32 

hydraulic  equipment 

F 

31 

line  spectrum 

A/N 

31 

linear  accelerator 

A/N 

30 

probability  density 

G 

200 

30 

radio  transmitter 

N/N 

29 

8 tress  wave 

2 

C 

890 

371 

29 

surface  energy 

F 

28 

thermal  shock 

35 

G 

633 

597 

28 

titanium  oxide 

F 
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SAMPLE  6 


SA  f 

Term  Name 

27 

ground  test 

26 

earth  motion 

25 

broad  band  amplifier 

25 

thrust  measurement 

24 

pressure  oscillation 

23 

logic  network 

22 

Cosmos  satellite 

22 

sensory  deprivation 

21 

ground  control 

21 

toroidal  shell 

20 

nickel  compound 

19 

electric  potential 

19 

shear  strength 

18 

lateral  control 

17 

anisotropic  shell 

17 

organic  coolant 

16 

compression  buckling 

16 

nuclear  effect 

16 

xenon  light 

15 

lithium  alloy 

15 

vacuum  melting 

14 

gravity  center 

14 

solar  observer 

13 

electron  recombination 

SAMPLE  7 


ab 


CBU 


f 

a 


C 

F 

N/A 

3  C  418 
C 

N/N 

N/A 

N/N 

P 

N/A 

F 

F 

12  G  201 

N/A 
N/A 
A/N 
F 
F 

N/A 

F 

4  G  176 

N/A 

A/N 

A/N 


437  714 

34 


689  39 


131  78 


NASA  f  Term  Name 


ab 


CBU 


f 

a 


13  zirconium  compound 

12  Oseen  approximation 

11  Feynman  diagram 

11  video  equipment 

10  oxygen  recombination 

9  decision  element 

9  smoke  trail 

8  Haynes  alloy 

8  terrier  missile 

7  fuel  pump 

7  rocket  project 

6  conversion  table 


F 

N/A 

N/A 

N/A 

A/N 

N/A 

N/N 

N/A 

N/A 

G  29 

F 

F 
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SAMPLE  8 


NASA 

f 

Term  Name 

fab 

CBU 

6 

Gulliver  program 

N/A 

6 

plutonium  compound 

N/A 

6 

X  band 

A/N 

5 

hermetical  seal 

N/A 

5 

Rayleigh  member 

N/A 

4 

aircraft  antenna 

A/N 

4 

gas  evacuation 

F 

4 

period  equation 

F 

4 

thrust  termination 

A/N 

3 

celescope  project 

N/A 

3 

Herzberg  band 

N/A 

3 

negative  conductance 

N/N 

SAMPLE 

9 

NASA 

f 

Term  Name 

f  . 

CBU 

ab 

3 

sodium  gallate 

A/N 

2 

ammonium  picrate 

N/N 

2 

Delilah  project 

N/A 

2 

Hill  method 

N/A 

2 

Multhopp  method 

N/A 

2 

quadranted  meteor 

N/N 

2 

success  project 

N/A 

1 

aircraft  accessory 

A/N 

1 

Cepheus  constellation 

N/N 

1 

Dyson  Theory 

N/A 

1 

Mellas  region 

N/A 

1 

lead  acetate 

A/N 

SAMPLE 

10 

NASA 

f 

Term  Name 

f  . 

CBU 

ab 

1 

muscular  function 

N/A 

1 

Piapacs  project 

N/A 

1 

scale  error 

F 

1 

swordfish  operation 

N/A 

1 

Vintis  Theory 

N/A 

0 

Cassiopeia  constellation 

N/N 

0 

Ekman  layer 

N/A 

0 

high  volume 

F 

0 

MAC  project 

N/A 

0 

organic  laser 

A/N 

0 

retargeting  missile 

N/A 

0 

surgical  instrument 

N/A 
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DISTRIBUTION  OF  C  ,  VALUES  IN  A  SAMPLE  OF  CBU  MASTER  PAIR  LIST* 

ah 


Every  55th  entry  (the  one  at  the  top  of  the  page)  in  the  alpha¬ 
betical  interval  A  through  P  in  this  list  was  sampled.  The  entries 
represent  all  word  pairs  ab  in  GE-2  for  which 


f  2040 

a  — 

f.  ^  2040 

b  — 

Cab  -  ab  ,  450,000  >  20 

f  •  fu 
a  b 

Note  that  frequencies  are  not  coalesced. 

The  number  of  entries  with  C  values  in  certain  ranges  were 
tallied.  The  probability  that  a  pair  meeting  the  above  criteria  will 
have  in  C  ^  in  the  stated  range  was  calculated. 


^ab  *nterval  Interval  size  No.  Entries  P  *  No.  Entries 

-  -  -  -  94 


20-49 

30 

23 

.25 

50-99 

50 

20 

.21 

100-199 

100 

9 

.094 

200-299 

100 

8 

.085 

300-499 

200 

5 

.053 

500-999 

500 

6 

.064 

1000-1999 

1000 

5 

.053 

2000-4999 

3000 

8 

.085 

5000  + 

10 

.  107 

94 


*  Not  previously  issued.  By  Vincent  E.  Giuliano  and  Paul  E.  Jones 
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Probability  that  a  Word  Pair  (in  the  GE  2-A  -  Master  CBU  list)  will  have  Cab  ^  X 


Section  IV 


SUMMARY  OF  DATA  PRINTOUTS  RETAINED 


The  list  below  contains  a  brief  description  of  the  more  important 

data  collected  in  printout  form. 

1.  A  listing  of  all  GE-1  documents  with  the  set  of  GE-1A  index  terms 
assigned . 

2.  A  listing  of G  E-1A  index  terras  with  the  set  of  documents  to  which  they 
were  assigned. 

3.  A  listing  of  the  words  appearing  in  the  10,000  abstract  GE-2  collection 
and  their  frequencies. 

4.  A  listing  for  each  word  pair  in  the  G6-2  collection,  the  frequency  of 
the  pair  and  the  frequencies  of  the  two  words. 

5.  A  listing  of  1,  2,  3,  and  4  word  strings  in  frequency  order. 

6.  A  listing  for  each  frequency  of  the  number  of  word  types  with  that 
frequency . 

7.  An  alphabetic  listing  of  all  3  and  4  word  strings  appearing  3  or 
more  times  with  frequencies  of  constituent  substrings  given. 

8.  An  alphabetic  listing  of  all  word  pairs  designated  as  content  bearing  units. 

9.  A  listing  in  intervals  of  200,  of  the  number  of  abstracts  with  a 
given  length  in  the  interval  and  cumulative,  taken  in  accession 
number  order. 

10.  A  listing  of  association  profiles  for  each  of  the  1000  GE  2  terms  based 
on  various  matrices. 
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